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Preface 



Molecular Computing is a fast-emerging area of Natural Computing. On the one 
hand, it is concerned with the use of (bio)molecules for the purpose of computing 
while on the other hand it tries to understand the computational nature of the 
molecular processes going on in living cells. 

The paper “Molecular computation of solutions to combinatorial problems” 
by L. Adleman, which describes a laboratory experiment and was published in 
Science in November 1994, was an important milestone for the area of molec- 
ular computing, as it provided a “proof-of-principle” that one can indeed per- 
form computations in a biolab using biomolecules and their processing using 
biomolecular operations. However, research concerning the computational na- 
ture of biomolecular operations dates back to before 1994. In particular, a pio- 
neering work concerning the mathematical theory of biooperations is the paper 
“Formal language theory and DNA: an analysis of the generative capacity of spe- 
cific recombinant behaviors,” authored by Tom Head in 1987, which appeared 
in the Bulletin of Mathematical Biology. The paper uses the framework of for- 
mal language theory to formulate and investigate the computational effects of 
biomolecular operations carried out by restriction enzymes. This paper has in- 
fluenced research in both molecular computing and formal language theory. In 
molecular computing it has led to a clear computational understanding of im- 
portant biomolecular operations occurring in nature, and it has also stimulated 
the design of a number of laboratory experiments utilizing Tom’s ideas for the 
purpose of human-designed DNA computing. In formal language theory it has 
led to a novel, interesting and challenging research area, originally called “splic- 
ing systems” and then renamed “H systems” in honor of Tom (“H” stands for 
“Head”). Many papers stimulated by the pioneering ideas presented by Tom in 
his seminal paper were written by researchers from all over the world. 

Adleman’s paper was a great event in Tom’s life: it has confirmed his con- 
viction that biooperations can be used for the purpose of computing, but more 
importantly it has stimulated his interest in experimental research. One can 
safely say that since 1994 most of Tom’s research has been focused on the de- 
sign of experiments related to DNA computing. Also on this research path he 
remained highly innovative and original, combining his great talent for modeling 
with a passsion for experimental biology. A good manifestation of this line of 
Tom’s research is aqueous computing - a really elegant but also experimentally 
feasible model of molecular computing invented by him. 

An example of the recognition of Tom’s research within the molecular com- 
puting community is the “DNA Scientist of the Year” award that Tom received 
in 2003. 

Tom’s multidisciplinary talents and interests become even more evident when 
one realizes that his original training and passion was mathematics, and in par- 
ticular algebra. He moved from there to formal language theory. It is also im- 
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portant to keep in mind that his work on formal models for biology started long 
before his 1987 paper, as he was a very active and productive researcher in the 
area of Lindenmayer systems that model the development of simple multicellular 
organisms, not on the molecular but rather on the cellular level. 

With this volume, presenting many aspects of research in (or stimulated 
by) molecular computing, we celebrate a scientist who has been a source of 
inspiration to many researchers, and to us a mentor, a scientific collaborator, 
and a warm and caring friend. 

HAPPY BIRTHDAY, Tom! 

November 2003 Natasa Jonoska 

Gheorghe Paun 
Grzegorz Rozenberg 




Table of Contents 



Solving Graph Problems by P Systems with Restricted Elementary 

Active Membranes 1 

Artiom Alhazov, Carlos Martin- Vide, Linqiang Pan 

Writing Information into DNA 23 

Masanori Arita 

Balance Machines: Computing = Balancing 36 

Joshua J. Arulanandham, Cristian S. Calude, Michael J. Dinneen 

Eilenberg P Systems with Symbol-Objects 49 

Francesco Bernardini, Marian Gheorghe, Mike Holcombe 

Molecular Tiling and DNA Self-assembly 61 

Alessandra Carbone, Nadrian G. Seeman 

On Some Classes of Splicing Languages 84 

Rodica Ceterchi, Carlos Martin-Vide, K.G. Subramanian 

The Power of Networks of Watson-Crick DOL Systems 106 

Erzsebet Csuhaj-Varju, Arto Salomaa 

Fixed Point Approach to Commutation of Languages 119 

Karel Gulik II, Juhani Karhumdki, Petri Salmela 

Remarks on Relativisations and DNA Encodings 132 

Claudio Ferretti, Giancarlo Mauri 

Splicing Test Tube Systems and Their Relation to Splicing 

Membrane Systems 139 

Franziska Freund, Rudolf Freund, Marion Oswald 

Digital Information Encoding on DNA 152 

Max H. Garzon, Kiranchand V. Bobba, Bryan P. Hyde 

DNA-based Cryptography 167 

Ashish Gehani, Thomas LaBean, John Reif 

Splicing to the Limit 189 

Elizabeth Goode, Dennis Pixton 




X 



Table of Contents 



Formal Properties of Gene Assembly: Equivalence Problem for 

Overlap Graphs 202 

Tero Harju, Ion Petre, Grzegorz Rozenberg 

n-Insertion on Languages 213 

Masami Ito, Ryo Sugiura 

Transducers with Programmable Input by DNA Self-assembly 219 

Natasa Jonoska, Shiping Liao, Hadrian C. Seeman 

Methods for Gonstructing Goded DNA Languages 241 

Natasa Jonoska, Kalpana Mahalingam 

On the Universality of P Systems with Minimal Symport /Antiport 

Rules 254 

Lila Kari, Carlos Martin- Vide, Andrei Pdun 

An Algorithm for Testing Structure Freeness of Biomolecular 

Sequences 266 

Satoshi Kobayashi, Takashi Yokomori, Yasubumi Sakakibara 

On Languages of Gyclic Words 278 

Manfred Kudlek 

A DNA Algorithm for the Hamiltonian Path Problem Using 

Microfluidic Systems 289 

Lucas Ledesma, Juan Pazos, Alfonso Rodriguez-Paton 

Formal Languages Arising from Gene Repeated Duplication 297 

Peter Leupold, Victor Mitrana, Jose M. Sempere 

A Proof of Regularity for Finite Splicing 309 

Vincenzo Manca 

The Duality of Patterning in Molecular Genetics 318 

Solomon Marcus 

Membrane Gomputing: Some Non-standard Ideas 322 

Gheorghe Pdun 

The P Versus NP Problem Through Gellular Gomputing with 

Membranes 338 

Mario J. Perez- Jimenez, Alvaro Romero- Jimenez, 

Fernando Sancho-Caparrini 

Realizing Switching Functions Using Peptide-Antibody Interactions 353 

M. Sakthi Balan, Kamala Krithivasan 




Table of Contents XI 

Plasmids to Solve #3SAT 361 

Rani Siromoney, Bireswar Das 

Communicating Distributed H Systems with Alternating Filters 367 

Sergey Verlan 

Publications by Thomas J. Head 385 

Author Index 391 




Solving Graph Problems by P Systems with 
Restricted Elementary Active Membranes 



Artiom Alhazov^’^, Carlos Martm-Vide^, and Linqiang Pan^’^ 

^ Institute of Mathematics and Computer Science 
Academy of Science of Moldova 
Str. Academiei 5, Chisinau, MD 2028, Moldova 
art i omOmath . md 

^ Research Group on Mathematical Linguistics 
Rovira i Virgili University 
PL Imperial Tarraco 1, 43005 Tarragona, Spain 
aa2 . doc@estudiants .urv.es, 
cmv@astor.urv.es, lp@fll.urv.es 
® Department of Control Science and Engineering 
Huazhong University of Science and Technology 
Wuhan 430074, Hubei, People’s Republic of China 
IqpanSmail . bust . edu . cn 



Abstract. P systems are parallel molecular computing models based on 
processing multisets of objects in cell-like membrane structures. In this 
paper we give membrane algorithms to solve the vertex cover problem 
and the clique problem in linear time with respect to the number of 
vertices and edges of the graph by recognizing P systems with active 
membranes using 2-division. Also, the linear time solution of the vertex 
cover problem is given using P systems with active membranes using 
2- division and linear resources. 



1 Introduction 

The P systems are a class of distributed parallel computing devices of a bio- 
chemical type, introduced in [2], which can be seen as a general computing 
architecture where various types of objects can be processed by various opera- 
tions. It comes from the observation that certain processes which take place in 
the complex structure of living organisms can be considered as computations. 
For a motivation and detailed description of various P system models we refer 
the reader to [2], [4]. 

In [3] Paun considered P systems where the number of membranes increases 
during a computation by using some division rules, which are called P systems 
with active membranes. These systems model the natural division of cells. 

In [6] Perez-Jimenez et al. solve the satisfiability problem in linear time with 
respect to the number of variables and clauses of propositional formula by recog- 
nizing P systems with active membranes using 2-division. Thus the vertex cover 
problem and the clique problem belonging to the class of NP problems can also 
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be solved in a polynomial time by recognizing P systems with active membranes 
using 2-division. One can get solutions for the vertex cover problem and the 
clique problem by reducing these problems to the satisfiability problem in order 
to apply those P systems which solve satisfiability problem in linear time. In this 
paper, we do not apply the cumbersome and time consuming (polynomial time) 
reduction, but we directly give membrane algorithms to solve the vertex cover 
problem and the clique problem in linear time with respect to the number of 
vertices and edges of graph by recognizing P systems with restricted elementary 
active membranes. In this solution, the structure of P systems is uniform, but 
we need polynomial resources to describe P systems. Using linear resources to 
describe P systems, we also give a membrane algorithm to solve the vertex cover 
problem in linear time by semi-uniform P systems. 

The paper is organized as follows: in Section 2 we introduce P systems with 
active membranes; in Section 3 we define the notion of recognizing P system; 
in Section 4 the polynomial complexity class in computing with membranes, 
PMCjr, is recalled. Section 5 gives a membrane algorithm to solve the vertex 
cover problem in linear time with respect to the number of vertices and edges of 
graph by recognizing P systems with restricted elementary active membranes, 
and give a formal verification. In Section 6, we give a membrane solution of 
the vertex cover problem by semi-uniform P systems with restricted elementary 
active membranes and linear resources. In Section 7, the clique problem is solved 
in linear time with respect to the number of vertices and edges of the graph by 
recognizing P systems with restricted elementary active membranes. Finally, 
Section 8 contains some discussion. 

2 P Systems with Active Membranes 

In this section we describe P systems with active membranes due to [3], where 
more details can also be found. 

A membrane structure is represented by a Venn diagram and is identified 
by a string of correctly matching parentheses, with a unique external pair of 
parentheses; this external pair of parentheses corresponds to the external mem- 
brane, called the skin. A membrane without any other membrane inside is said 
to be elementary. For instance, the structure in Figure 1 contains 8 membranes; 
membranes 3, 5, 6 and 8 are elementary. The string of parentheses identifying 
this structure is 

~ [l[2[5 Iste ]6]2l3 IsUIrts Is] 714 ] 1' 

All membranes are labeled; here we have used the numbers from 1 to 8. We say 
that the number of membranes is the degree of the membrane structure, while 
the height of the tree associated in the usual way with the structure is its depth. 
In the example above we have a membrane structure of degree 8 and of depth 4. 

In what follows, the membranes can be marked with -|- or — , and this is 
interpreted as an “electrical charge”, or with 0, and this means “neutral charge”. 
We will write [. Jl*", ]]"’ L li ™ three cases, respectively. 
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The membranes delimit regions, precisely identified by the membranes (the 
region of a membrane is delimited by the membrane and all membranes placed 
immediately inside it, if any such a membrane exists). In these regions we place 
objects, which are represented by symbols of an alphabet. Several copies of the 
same object can be present in a region, so we work with multisets of objects. A 
multiset over an alphabet V is represented by a string over V : the number of 
occurrences of a symbol a £ V in a string x £ V* (T^* is the set of all strings 
over V; the empty string is denoted by A) is denoted by \x\a and it represents 
the multiplicity of the object a in the multiset represented by x. 



1 1 




Figure 1. A membrane structure and its associated tree 



A P system with active membranes and 2-division is a construct 
n = {0,H,fl,Wi, . . .,Wm,R), 



where: 

(i) TO > 1 (the initial degree of the system); 

(ii) O is the alphabet of objects; 

(iii) H is a finite set of labels for membranes; 

(iv) fj, is a membrane structure, consisting of to membranes, labelled (not nec- 
essarily in a one-to-one manner) with elements of H ; 

(v) wi, . . . , Wm are strings over O, describing the multisets of objects placed in 
the TO regions of fv; 

(vi) i? is a finite set of developmental rules, of the following forms: 

(a) 

for h£H,a£{-\-,—,0},a£O,v£O* 

(object evolution rules, associated with membranes and depending on 
the label and the charge of the membranes, but not directly involving 
the membranes, in the sense that the membranes are neither taking 
part in the application of these rules nor are they modified by them); 

(b) 

for h £ H, oi, 02 S {-k, — , 0}, a,b £ O 

(communication rules; an object is introduced in the membrane, possi- 
bly modified during this process; also the polarization of the membrane 
can be modified, but not its label); 
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(c) 

for h € H, ai,a 2 G , 0}, a,b G O 

(communication rules; an object is sent out of the membrane, possibly 
modified during this process; also the polarization of the membrane can 
be modified, but not its label); 

(d) [,a]:^b, 

for h G H,a G {+, — , 0}, a,b G O 

(dissolving rules; in reaction with an object, a membrane can be dis- 
solved, while the object specified in the rule can be modified); 

(e) LHr-LHrucir, 

for h G H, Oi, 02, C (3 G {-f, — , 0}, a,b,cG O 

(division rules for elementary membranes; in reaction with an object, 
the membrane is divided into two membranes with the same label, pos- 
sibly of different polarizations; the object specified in the rule is replaced 
in the two new membranes by possibly new objects); 

r 1 «i r f 1 “2 r i «2 1 ao 

ho ^hi i hi ' ' ' Ihk > hk -I hk+i ' ' ' 1 hn,> ho 

, r r 1“3 r ]0!3]“5 r r i “4 r ]«4]Q!6 

UoUi ihi ■■ - ihk UfcJ/io UoUfc+i Jhfe+i ■ ■ ■ ihjho’ 

for k > l,n > k,hi G H,0 < i < n, and ao> ■ ■ ■ ,cee G {-I-, — , 0} with 
{ 01 , 02 } = if the membrane with the label ho contains other 

membranes than those with the labels ft-i, . . . , /i„ specified above, then 
they must have neutral charges in order to make this rule applicable; 
these membranes are duplicated and then are part of the contents of 
both new copies of the membrane ho 

(division of non-elementary membranes; this is possible only if a mem- 
brane contains two immediately lower membranes of opposite polariza- 
tion, -I- and — ; the membranes of opposite polarizations are separated 
in the two new membranes, but their polarization can change; always, 
all membranes of opposite polarizations are separated by applying this 
rule). 

For a detailed description of using these rules we refer to [3]. Here we only 
mention that all the above rules are applied in parallel, but at one step, a mem- 
brane h can be subject of only one rule of types (b)-(f). 

A P system with restricted active membranes is a P system with active mem- 
branes where the rules are of types (a), (b), (c), (e), and (f) only (i.e., a P system 
with active membranes not using membrane dissolving rules of type (d)). 

A P system with restricted elementary active membranes is a P system with 
active membranes where the rules are of types (a), (b), (c), and (e) only (i.e., a 
P system with active membranes not using membrane dissolving rules of type 
(d) and the (f) type rules for non-elementary membrane division). 

3 Recognizing P Systems 




In this section, we introduce the notion of recognizing P systems following [5]. 
First of all, we consider P systems with input and without external output. 
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Definition 31 ^ P system with input is a tuple {II, S,in), where: 

- n is a P system, with working alphabet P and initial multisets wi, - ■ ■ ,Wp 
(associated with membranes labelled byl,--- ,p, respectively). 

- S is an (input) alphabet strictly contained in P. 

- w\, - • ■ ,Wp are multisets over P — S. 

- in is the label of a distinguished membrane (of input). 

If w € M{E) is a multiset over E, then the initial configuration of {II,E,in) 
with input w is (/io,Mo), where yto = pt, Mo{j) = Wj, for each j ^ in, and 
Mo{in) = Wijj U w. 

The computations of a P system with input w G M{E) are defined in a 
natural way. Note that the initial configuration must be the initial configuration 
of the system associated with the input multiset w G M{E). 

In the case of P systems with input and with external output (where we can 
imagine that the internal processes are unknown, and we only obtain the informa- 
tion that the system sends out to the environment), the concept of computation 
is introduced in a similar way but with a slight variant. In the configurations, we 
will not work directly with the membrane structure pi but with another structure 
associated with it including, in some sense, the environment. 

Definition 32 Let pi = {V{pi), E{pi)) be a membrane structure. The membrane 
structure with environment associated with pi is the rooted tree Ext{pi) such that: 
(a) the root of the tree is a new node that we will denote env; (b) the set of nodes 
is V{pi) U {env}; (c) the set of edges is E{pi) U {{env , skin}} . The node env is 
called the environment of the structure pi. 

Note that we have only included a new node representing the environment 
which is only connected with the skin, while the original membrane structure 
remains unchanged. In this way, every configuration of the system informs about 
the environment and its content. 

Now we introduce recognizing P systems as devices able to accept or reject 
multisets considered as input. 

Definition 33 A recognizing P system is a P system with input, {II, E,in), 
and with external output such that: 

I The working alphabet contains two distinguished elements yes, no. 

2. All computations of the system halt. 

3. If C is a computation of II, then either the object yes or the object no (but 
not both) have to be sent out to the environment, and only in the last step 
of the computation. 



Definition 34 We say that C is an accepting computation (respectively, reject- 
ing computation) if the object yes (respectively, no) appears in the environment 
associated with the corresponding halting configuration ofC. 
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4 The Polynomial Complexity Class PMCjr 

Many practical problems are presumably intractable for conventional (electronic) 
computers, because all known algorithms solving these problems spend expo- 
nential time. P systems have an inherent parallelism and hence the capability to 
solve hard problems in feasible (polynomial or linear) time. 

To understand what it means that a problem can be solved in polynomial 
time by membrane systems, it is necessary to recall some complexity measure 
for P systems as described in [5]. 

A decision problem will be solved by a family of recognizing P systems in such 
a way that given an instance of the problem it is necessary to fix the concrete P 
system (with a suitable input multiset) that will process it. The next definition 
{polynomial encoding) captures this idea. 

Definition 41 Let L he a language, T a class of P systems with input and 
n = {n{t))tefi a family of P systems of type T. A polynomial encoding of L in 
n is a pair {g, h) of polynomial time computable functions, g : L ^ Utenln(t) 
and h : L such that for every u € L we have g{u) G In(h(u)) ■ 

Now we define what it means solving a decision problem by a family of 
recognizing P systems in time bounded by a given function. 

Definition 42 Let T he a class of recognizing P systems, / : N ^ N a total 
computable function, and X = {Ix,Qx) a decision problem. We say that X 
belongs to MCjr(/) if there exists a family, II = (iI(t))tgN, of P systems such 
that: 

- n is T -consistent: that is, LL{t) G IF for all t G N. 

~ n is polynomially uniform: that is, there exists a deterministic Turing ma- 
chine that constructs lift) in polynomial time from t G N. 

— There exists a polynomial encoding {g, h) from Ix to 11 verifying: 

• n is f -bounded, with regard to {g,h); that is, for each u G Ix, all com- 
putations of n{h{u)) with input g{u) halt in at most /(|m|) steps. 

• n is X-sound, with regard to {g,h); that is, for each u G Ix, if 
there exists an accepting computation of 11 {h{u)) with input g{u), then 
6x{u) = 1. 

• n is X-complete, with regard to {g,h); that is, for each u G Ix, if 
9x{u) = 1, then every computation of II{h{u)) with input g{u) is an 
accepting computation. 

A polynomial encoding {g, h) from Ix to II provides a size function, h, that 
gives us the set of instances of X processed through the same P system, and an 
input function, g, supplying the input multiset to be processed for the P system. 

Note 1. In the above definition we have imposed a confluence property in 
the following sense: for every input u G Ix, either every computation of TI{h{u)) 
with input g{u) is an accepting computation or every computation of TI{h{u)) 
with input g{u) is a rejecting computation. 




Solving Graph Problems by P Systems 



7 



Definition 43 The polynomial complexity class associated with a collection of 
recognizing P systems, T , is defined as follows: 

PMC^ = U MC^(/). 

f polynomial 

In particular, the union of all classes MCjr(/), for f a linear function, is denoted 

by LMC^. 

Note 2. If a decision problem belongs to PMCjr, then we say that it is 
solvable in polynomial time by a family of P systems which are constructed in 
polynomial time starting from the size of the instances of the problem. 

We say that a family of P systems T is polynomially semi-uniform, if there 
exists a deterministic Turing machine that constructs a P system in polyno- 
mial time from each instance of the problem [4]. We denote the corresponding 
complexity classes by LMCf- and PMCf-. 

5 Solving the Vertex Cover Problem 

Given a graph G with n vertices and m edges, a vertex cover of G is a subset 
of the vertices such that every edge of G is adjacent to one of those vertices; 
the vertex cover problem (denoted by VCP ) asks whether or not there exists a 
vertex cover for G of size k, where fc is a given integer less than or equal to n. 
The vertex cover problem is an NP-complete problem [1]. 

Next we construct a family of recognizing P systems with active membranes 
using 2 -division solving the vertex cover problem in linear time. 

For that, first we consider the (size) function h defined on Ivcp by h{G) = 
{{m n){m n l)/2) -|- m, with G = {V, E), where V = {vi,V 2 , • • • , fn}, 
E = {ci, 62 , • • • , Cm}- The function h is polynomial time computable since the 
function (m,n) = ((m-|-n)(m-|-n-|- l)/ 2 ) -|-m is primitive recursive and bijective 
from onto N. Also, the inverse function of h is polynomial. 

For each {m,n) G we consider the P system E{m,n), 

i{m,n)), where E{m,n) = \ I E x < m,l < i,j < n, the 

two vertices of edge Cx are Vi and Vj}, i{m,n) = 2 and II{{m,n)) = 
{r{m, n), { 1 , 2 }, [^^[2 ] 2 ] wi, W 2 , i?), with E{m, n) defined as follows: 

r{m, n) = E{m, n) U {di | 1 < z < 3n -I- 2m -I- 3} 

U {rij |0<z<m, l<j< 2n} U {cij | 0 < z < A:, 1 < j < 4n — 1} 

U {fi I 1 < z < m -I- 2} U {f, A, yes, no}. 

The initial content of each membrane is: w\ = % and W 2 = {di,co,i}. The set 
of rules, R, is given by (we also give explanations about the use of these rules 
during the computations): 
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1. [^di]1^[^di]^ , l<i<n. 

By using a rule of type 1, a membrane with label 2 is divided into two 
membranes with the same label, but with different polarizations. These rules 
allow us to duplicate, in one step, the total number of membranes with 
label 2. 

2- [ 2 ^ 0 ;. (i,j) ^ ^<x<m, 2<j<n. 

[ 2 ^ 0 ;. (ij) ^ A]“, 1 < a; < TO, 2 < j < n. 

The rules of type 2 try to implement a process allowing membranes with 
label 2 to encode whether vertex vi appears in a subset of vertices, in such a 
way that if vertex v\ appears in a subset of vertices, then the objects 
encoding edges which are adjacent to vertex v\ will evolve to objects 
in the corresponding membranes with label 2; otherwise, the objects ea,yij) 
will disappear. 

3- [ 2 ^ 0 ;, ^ ^ l<x<m, 2<i<n, i<j<n. 

1 < a; < TO, 2 < z < n, z < j < n. 

The evolving process described previously is always made with respect to 
the vertex vi . Hence, the rules of type 3 take charge of making a cyclic path 
through all the vertices to get that, initially, the first vertex is vi, then V 2 , 
and so on. 

4. [^Ci^j C+iy] 0 < z < fc - 1, 1 < j < 2n - 1. 
i^Ckj ^ A]+, 1 < j < 2zz- 1. 

The rules of type 4 supply counters in the membranes with label 2, in such a 
way that we increase the first subscript of Cij, when the membrane has not 
more than k vertices; when the membrane has more than k vertices, then 
the counter-object Ckj will disappear. So when the process of generating 
all subsets of vertices is finished, the object Ckj (note the first subscript is 
k) will appear only in the membrane with label 2 encoding a subset with 
cardinality exactly k. 

5. [ 2 ^*]^ ^ [2 ] 2 '^i> I <i <n. 

[2 1 < t < n. 

M 2 1° ^ [2^i+l]2> 1 < t < n - 1- 

The rules of type 5 are used as controllers of the generating process of all 
subsets of vertices and the listing of adjacent edges: the objects d are sent to 
the membrane with label 1 at the same time the listing of adjacent edges and 
the counting of vertices are made, and they come back to the membranes 
with label 2 to start the division of these membranes. 

6. [ 2 ?"*^ ^ Xij+i] 2 , 1 < t < TO, 1 < j < 2n — 1. 

The use of objects r in the rules of types 12, 13, and 14 makes necessary to 
perform a “rotation” of these objects. This is the mission of the rules of type 
6. 

7. Cjj+i]®, 0 < z < fc, 1 < j < 2zz - 1. 

The second subscript of Cij (also in the rule of type 9) is used to control 
when the process of checking whether a subset of vertices with cardinality k 
is a vertex cover will start. 
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8. [^di ^ di+i]^, n < i < 3n — 2. 

Through the counter-objects d, the rules of type 8 control the rotation of 
the elements rij in the membranes with label 2. 

9. [^Ckj Ck,j+i]l, 2n< j < 4n- 3. 

[2^k,4n—2 ^ 1 /l] 2 ■ 

When the generating process of all subsets of vertices and the listing of 
adjacent edges is finished, the second subscript of all objects Cij {I < i < k) 
is 2n. After that, the second subscript of only objects Ckj (note that the 
first subscript is k) will increase. The object Ck, 4 n -2 evolves to Ck, 4 n-ifi- 
The object Ck, 4 n-i will change the polarization of the membrane to negative 
(using the rule of type 10). The object fi is a new counter. 

10. [2Cfcj4n— 1(2 ^ [2 ^2 ^^,4n-l- 

\ld^n-l d^n] i- 

The application of these rules will show that the system is ready to check 
which edges are covered by the vertices in a subset with cardinality exactly 
k encoded by membrane with label 2. 

11. [^di ^ di+i]^, 3n < i < 3n + 2m + 2. 

The rules of type 11 supply counters in the membrane with label 1, through 
objects d, in such a way that if the objects dzn+ 2 m appear, then they show 
the end of the checking of vertex cover. The objects di, with 3n -b 2m -b 1 < 
i < 3n + 2m + 3, will control the final stage of the computation. 

12. [2?"l,2n]2 [2 ]2^1i2«- 

The applicability of the rule of type 12 encodes the fact that the vertices 
encoded by a membrane with label 2 cover the edge ei represented by the 
object ri^ 2 n, through a change in the sign of its polarization. 

13. [2G,2n ^ G-l,2n]^, 1 < I < TO. 

In the copies of membrane 2 with positive polarization, hence only those 
where we have found ri^ 2 n in the previous step, we decrease the first sub- 
scripts of all objects ri^ 2 n from those membranes. Thus, if the second edge 
was covered by the vertices from a membrane, that is r 2 , 2 n was present in 
a membrane, then r 2 . 2 n becomes ri^ 2 n ~ hence the rule of type 12 can be 
applied again. Note the important fact that passing from r 2 , 2 n to ri^ 2 n is 
possible only in membranes where we already had ri 2 n, hence we check 
whether the second edge is covered only after knowing that the first edge 
was covered. 

14. ru2u[2 ^ [2'’0.2n]^- 

At the same time as the use of rule of type 13, the object ri^ 2 n from the 
skin membrane returns to membranes with label 2, changed to ro, 2 n (which 
will never evolve again), and returning the polarization of the membrane to 
negative (this makes possible the use of the rule of type 12). 

15- [ 2 /* ^ l<i<m. 

The presence of objects fi (with 2 < f < to -b 1) in the membranes with 
label 2 shows that the corresponding subsets of vertices covering every edge 
of {ei, • • • , 6i-i} are being determined. 
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16- [ 2 fm+l \2 ^[2]2/™+l’ 

The rule of type 16 sends to the skin membrane the objects fm+i appearing 
in the membranes with label 2. 

17. [^/m+l ^ /m+2t] 1- 

By using the rule of type 17 the objects fm+i in the skin evolve to objects 
fm+2t- The objects t in the skin are produced simultaneously with the ap- 
pearance of the objects d3„+2m+2 in the skin, and they will show that there 
exists some subset of vertices which is a vertex cover with cardinality k. 

18. 

The rule of type 18 sends out of the system an object t changing the po- 
larization of the skin to positive, then objects t remaining in the skin are 
not able to evolve. Hence, an object fm+2 can exit the skin producing an 
object yes. This object is then sent out to the environment through the rule 
of type 19, telling us that there exists a vertex cover with cardinality k, and 
the computation halts. 

19. [Jm+2]t II ]iyes. 

The applicability of the rule of type 19 changes the polarization in the skin 
membrane to negative in order that the objects fm+2 remaining in it are not 
able to continue evolving. 

20 . [2^^3^+2772-1-3] I ^ [ill 

By the rule of type 20 the object ^377+2777+3 only evolves when the skin has 
neutral charge (this is the case when there does not exist any vertex cover 
with cardinality k) . Then the system will evolve sending out to the environ- 
ment an object no and changing the polarization of the skin to positive, in 
order that objects ^377+2777+3 remaining in the skin do not evolve. 

From the previous explanation of the use of rules, one can easily see how these 
P systems work. It is easy to prove that the designed P systems are deterministic. 

Now, we prove that the family II = (7T(t))igN solves the vertex cover problem 
in linear time. 

The above description of the evolution rules is computable in an uniform 
way. So, the family II = is polynomially uniform because: 

- The total number of objects is 2nm -|- 4nfc -I- 9n -I- 4m — /c -|- 8 G O(n^). 

- The number of membranes is 2. 

- The cardinality of the initial multisets is 2. 

- The total number of evolution rules is 0{n^). 

- The maximal length of a rule (the number of symbols necessary to write a 
rule, both its left and right sides, the membranes, and the polarizations of 
membranes involved in the rule) is 13. 

We consider the (input) function g : Ivcp ^t+nln(t)t defined as follows: 

g{G) = {e2,yij)|l <x<m,l<i,j <n, 

the two vertices of edge Cx are Vi and Vj}. 
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Then 5 is a polynomial time computable function. Moreover, the pair (g, h) 
is a polynomial encoding from VCP to II since for each G G Wcp we have 

g{G) G In{h(G))- 

We will denote by C = (C®)o<i<« the computation of the P system II{h{G)) 
with input g{G). That is, C* is the configuration obtained after i steps of the 
computation C. 

The execution of the P system U{h{G)) with input g{G) can be structured 
in four stages: a stage of generation of all subsets of vertices and counting the 
cardinality of subsets of vertices; a stage of synchronization; a stage of checking 
whether there is some subset with cardinality k which is a vertex cover; and a 
stage of output. 

The generating and counting stages are controlled by the objects di, with 
1 < i < n. 

- The presence in the skin of one object di, with 1 < z < n, will show that all 
possible subsets associated with {ui, • • • ,Vi} have been generated. 

- The objects Cjj with 0 < z < fc, 1 < j < 2n are used to count the number 
of vertices in membranes with label 2. When this stage ends, the object Ckj 
(note the first subscript is k) will appear only in the membrane with label 2 
encoding a subset with cardinality exactly k. 

- In this stage, we simultaneously encode in every internal membrane all the 
edges being covered by the subset of vertices represented by the membrane 
(through the objects Vij). 

- The object di appears in the skin after the execution of 2 steps. From the 
appearance of di in the skin to the appearance of di+i, with 1 < z < n — 1, 
3 steps have been executed. 

- This stage ends when the object appears in the skin. 

Hence, the total number of steps in the generating and counting stages is 3zz — 1. 

The synchronization stage has the goal of unifying the second subscripts of 
the objects ri,j, to make them equal to 2zz. 

- This stage starts with the evolution of the object in the skin. 

- In every step of this stage the object di, with rz < z < 3n — I, in the skin 
evolves to di+i. 

- In every step of this stage, the second subscript of objects Ckj (note that the 
first subscript is k) will increase. The object Cfc,4„_2 evolves to Ck,An-ifi- 

- This stage ends as soon as the object dzn appears in the skin, that is the 
moment when the membrane with label 2 encoding a subset with cardinality 
k has negative charge (by using the first rule of type 10). 

Therefore, the synchronization stage needs a total of 2n steps. 

The checking stage has the goal to determine how many edges are covered in 
the membrane with label 2 encoding the subset of vertices with cardinality k. 
This stage is controlled by the objects fi, with 1 < z < to + 1. 

- The presence of an object fi in a membrane with label 2 shows that the 
edges Cl, • • • , e^-i are covered by the subset of vertices represented by such 
membrane. 
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- From every ft (with 1 < z < m) the object fi+i is obtained in some mem- 
branes after the execution of 2 steps. 

- The checking stage ends as soon as the object d^n+ 2 m appears in the skin. 

Therefore, the total number of steps of this stage is 2m. 

The output stage starts immediately after the appearance of the object 
d 3 n+ 2 m in the skin and it is controlled by the objects fm+i and fm+ 2 - 

- To produce the output yes the object fm+i must have been produced in 

some membrane with label 2 of the configuration Then, after 4 

steps the system returns yes to the environment, through the evolution of 
objects fm +2 present in the skin, and when it has positive charge. 

- To produce the output no, no object fm+i appears in any membrane with 

label 2 of the configuration Then after 4 steps the system returns 

no to the environment, through the evolution of objects d 3 „+ 2 m +3 present 
in the skin, and when it has neutral charge. 

Therefore, the total number of steps in the output stage is 4. 

Let us see that the family II is linearly bounded, with regard to {g,h). For 
that, it is enough to note that the time of the stages of the execution of II{h{G)) 
with input g{G) is: (a) generation stage, 3n — 1 steps; (b) synchronization stage, 
2n steps; (c) checking stage, 2m steps; and (d) output stage, 4 steps. Hence, the 
total execution time of II{h{G)) with input g{G) is 5n -|- 2m -|- 3 G 0{n + m). 

Now, let us see that the family 11 is YGV-sound and YCF-complete, with 
respect to the polynomial encoding (g, h). For that it is sufficient to verify that 
the following results are true: 

1. If s is a subset with cardinality k covering all edges, then at the end of 
the checking stage (that is, in the configuration c^ri+ 2 m-i-^^ object fm+i 
appears in the membrane associated with s. 

2. If s is not a subset with cardinality k covering all edges, then at the end of 
the checking stage (that is, in the configuration C^n+ 2 m-i'^^ object fm+i 
does not appear in the membrane associated with s. 

Next we justify that the designed P systems II (t), with t G N, are recognizing 
devices. 

By inductive processes it can be proved that the configuration (j5n-i-2m-i 
verifies the following properties: 

(a) It has 2" membranes with label 2. The membranes with label 2 encoding 
subsets of vertices with cardinality exactly k have negative charge. The other 
membranes with label 2 have neutral charge. 

(b) The skin has neutral charge and its content is 

(c) If the object fm+i is present in a membrane with label 2, then the subset of 
vertices encoded by it is a vertex cover. 

Proposition 51 Suppose that there exist exactly q membranes with label 2 (with 
1 < q < {(l)) of containing some object fm+i- Then, in the last step 

of the computation, the P system sends out the object yes to environment. 
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Proof. Under the hypothesis of the proposition it is verified that: 

(a) Structure of 

The rule [2/m+i] 2 ^[212 fm+i is applicable over the membranes with label 
2 of containing the object fm+i, since their charge is negative. 

Hence, in the skin of the objects /m+i and d3„+2m+i niust appear, 

because of the application of the rule [j^d3„+2m ^ d3n+2m+i]^- The other 
membranes of C^n+2m-i j^o applicable rules. 

(b) Structure of 

The objects fm+i and d3„+2m+i in the skin evolve to fm+2t and d3n+2m+2, 
respectively, because of the application of the rules [.^fm+i /m+2^] 3 and 

[^c?3„_i_2m+i ^ d3n+2m+2] 3- The Other membranes of have no appli- 

cable rules. 

(c) Structure of c 5 ”+ 2 "»-i- 2 ^ 

The rule ^ sends out of the system an object t and changes the 

polarization of the skin to positive (in order that the objects t remaining in 
the skin do not continue being expelled of the system). Moreover, the objects 
d3n+2m-i-2 in the skin produce d3„+2m-i-3 because of the application of the 
rule [^d3n+2m-i-2 ^ d3„+2m-i-3] 3 • The other membranes of C5n-i-2m-i-i j^^ve no 
applicable rules. 

(d) Structure of C 5 ”+ 2 '"+ 3 . 

Having in mind that the polarization of the skin in the configuration 
C 5 ”+ 2'"+2 is positive, an object fm+2 will be expelled from the system pro- 
ducing the object yes, because of the application of the rule [j^/m-1-2])'' ^ 
[ill y^s. Moreover, the polarization of the skin changes to negative in order 
that no objects fm+2 remaining in the skin is able to continue evolving. The 
other membranes of (j 5 ”+ 2 ™-i -2 ji^ve no applicable rules. 

Finally, in the configuration c5n+2m+3 jg jjq applicable rule, so, it is a 

halting configuration. As the system has expelled the object yes to the environ- 
ment in the last step, C is an accepting computation. 

In a similar way we prove the following result. 

Proposition 52 Let us suppose that no membrane with label 2 of 
contains an object fm+i- Then, in the last step of the computation, the P system 
sends out the object no to environment. 

From the above propositions we deduce the soundness and completeness of 
the family 11, with respect to the vertex cover problem. 

Proposition 53 (Soundness) Let G be a graph. Lf the computation of 
LI{h{G)) with input g{G) is an accepting computation, then G has a vertex cover 
with cardinality k. 
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Proof. Let G be a graph. Let us suppose that the computation of II{h{G)) with 
input g{G) is an accepting computation. From Propositions 51 and 52 we deduce 
that some membranes with label 2 of contain the object fm+i- Hence 

some membrane with label 2 of encodes a subset of vertices with 

cardinality k covering all edges of the graph. Therefore, G has a vertex cover 
with cardinality k. 



Proposition 54 (Completeness) Let G be a graph. If it has a vertex cover 
with cardinality k, then the computation of II {h{G)) with input g{G) is an ac- 
cepting computation. 

Proof. Let G be a graph, which has a vertex cover with cardinality k. Then some 
membrane with label 2 of the configuration (encoding a relevant subset 

of vertices for G) contains an object /m+i . From Proposition 52 it follows that 
^5n+2m-i jg accepting configuration. That is, the computation of II{h{G)) 
with input g{G) is an accepting computation. 

The main result follows from all above. 

Theorem 1. The vertex cover problem can be solved in linear time with respect 
to the number of vertices and edges of the graph by recognizing P systems with 
restricted elementary active membranes. 

6 Solving the Vertex Cover Problem 

In this section we will solve the vertex cover problem in linear time (n + 2/c + 
2m + 8) by P systems with restricted elementary active membranes. Unlike the 
solution in the previous section, the solution is semi-uniform, but the resources, 
such as the number of objects, rules, and symbols in the description of the P 
systems is linear with respect to the number of vertices and the number of edges 
in the graph. 

Theorem 2. The vertex cover problem can be solved in linear time with respect 
to the number of vertices and edges of the graph by semi-uniform P systems with 
restricted elementary active membranes and linear resources. 

Proof. Let G = (V,E) be a graph with n vertices and m edges, V = {v\,V 2 , 
••• ,Vn}, E = {ei,e 2 ,--- ,em}- For /c = l,2,n — 1, and n, it is easy to check 
whether a given graph has a vertex cover of size k, so we can assume 3 < /c < 
n — 2. 

We construct the system 



n = {0,H,g,wi,W2,R), 
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where 



O = {oi, di, ti, t'i, fi I 1 < i < n} U {hi, h'i \ 0 < i < k}U {ci | 0 < i < m} 

U {gi I l<i<m+l}U {g'i | 1 < i < to} U {a, b,p, q, u, yes}, 

H={1,2}, 

h = [,U]Tv 

Wi = X, 

W2 = ai02 • • • Undi, 

and the set R contains the following rules 

1. [ 20 *]° ^ [2^i]2l2/i]2> 1 < * < 

The objects correspond to vertices Vi,l < i < n. By using a rule as 
above, for i non-deterministically chosen, we produce the two objects U and 
fi associated to vertex Vi, placed in two separate copies of membrane 2, 
where ti means that vertex Vi appears in some subset of vertices, and fi 
means that vertex Vi does not appear in some subset of vertices. Note that 
the charge remains the same for both membranes, namely neutral, hence 
the process can continue. In this way, in n steps we get all 2" subsets of 
V, placed in 2" separate copies of membrane 2. In turn, these 2” copies 
of membrane 2 are within membrane 1 - the system always has only two 
levels of membranes. Note that in spite of the fact that in each step the 
object Gi is non-deterministically chosen, after n steps we get the same result, 
irrespective of which objects were used in each step. 

2. [^di di+i]l, 1 < i < n - 1. 

3 . [^dn^qqholl- 

The objects di are counters. Initially, di is placed in membrane 2. By division 
(when using rules of type 1), we introduce copies of the counter in each new 
membrane. In each step, in each membrane with label 2, we pass from di to 
di+i, thus “counting to n” . In step n we introduce both q and ho; the objects 
q will exit the membrane (at steps n -I- 1, n -I- 2), changing its polarization, 
the object ho is a new counter which will be used at the subsequent steps as 
shown below. Note that at step n we have also completed the generation of 
all subsets. 

Now we pass to the next phase of computation - counting the number of 
objects ti (1 < i < n) in each membrane with label 2, which corresponds to 
the cardinality of each subset; we will select out the subsets with cardinality 
exactly k. 



4 . 


[29] ^ 


[2 ] 2 


5 . 


[2^] 2 




6 . 


[2^1 ^ 


• abt{\ 2 , 1 < t < n. 


7 . 


[2^1 “ 


■> ft.'] 2, 0 <i < k. 


8 . 


- 


^ hi+i]^ , 0 < i < k — 


9 . 


[2®] 2 ■ 


—>■ [2 ] 2 
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At step n+ 1, /iQ evolves to hg, and at the same time one copy of q exits the 
membrane, changing its polarization to negative. At step n + 2, ti evolves to 
abt[, and at the same time the other copy of q exits the membrane, changing 
its polarization to neutral. At step n + 3, one copy of a exits the membrane, 
changing its polarization to positive. At step n + 4, /ig evolves to hi, and at 
the same time one copy of b exits the membrane, returning its polarization 
to neutral (this makes possible the use of rules of types 7 and 9) . 

The rules of types 7, 8, 9, and 10 are applied as many times as possible (in 
one step rules of types 7 and 9, in the next one rules of types 8 and 10, and 
then we repeat the cycle). Clearly, at step n + 2 + 2k, a membrane contains 
object hk if and only if the cardinality of the corresponding subset is at 
least k. At step n + 3 + 2k, in the membranes whose corresponding subsets 
have cardinality more than k, hk evolves to h'^., and one copy of a changes 
their polarization to positive. These membranes will no longer evolve, as no 
further rule can be applied to them. In the membranes whose corresponding 
subsets have cardinality exactly k, hk evolves to h'^., and their polarization 
remains neutral, because there is no copy of a which can be used. We will 
begin the next phase of computation with the rule of type 11 - checking 
whether a subset with cardinality fc is a vertex cover. 

11 . ^ qqgi]^. 

12. ( 2 ^) ^ • • • Ci,] 2 5 1 ^ ^ vertex Vi is adjacent to edges . 

At step n + 4 + 2k, in the membranes with label 2 and polarization 0, h'f. 
evolves to qqgi . At step n+5+2fc, one copy of q exits the membrane, changing 
its polarization to negative. At step n + 6 + 2k, in parallel (1 < i < n) 
evolves to , and at same time the other copy of q exits the membrane, 

changing its polarization to neutral. After completing this step, if there is at 
least one membrane with label 2 which contains all symbols ei, • • • , e^, then 
this means that the subset corresponding to that membrane is a vertex cover 
of cardinality k. Otherwise (if in no membrane we get all objects Ci , • • • , e^), 
there exists no vertex cover of cardinality k. In the following steps, we will 
“read” the answer, and send a suitable signal out of the system. This will be 
done by the following rules. 

13. l^gi g-p\l, l<i<m. 

14. [ 261 ]° ^ [2 ]^u- 

Object gi evolves to g'^p. At the same time for all subsets of cardinality k, we 
check whether or not ei is present in each corresponding membrane. If this is 
the case, then one copy of ei exits the membrane where it is present, evolving 
to u, and changing in this way the polarization of that membrane to positive 
(the other copies of ei will immediately evolve to cq, which will never evolve 
again). The membranes which do not contain the object Ci remain neutrally 
charged and they will no longer evolve, as no further rule can be applied to 
them. 

15. [ 26 * ^ 1 < i < TO. 

16- [ 29 'i 9i+i]t^ l<i<m. 
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In the copies of membrane 2 corresponding to subsets of cardinality k with 
a positive polarization, hence only those where we have found ei in the 
previous step, we perform two operations in parallel: g' evolves to gi+i (so we 
count the number of edges), and we decrease the subscripts of all objects Cj 
from that membrane. Thus, if 62 is present in a membrane, then 62 becomes 
Cl - hence the rule of type 14 can be applied again. Note the important fact 
that passing from 62 to ci is possible only in membranes where we already 
had Cl, hence we check whether edge 62 appears only after knowing that edge 
Cl appears. At the same time, the object p exits the membrane, evolving to 
u, and returning the polarization of the membrane to neutral (this makes 
possible the use of rules of types 13 and 14). 

The rules of types 13, 14, 15, 16, and 17 are applied as many times as possible 
(in one step rules of types 13 and 14, in the next one rules of types 15, 16, and 
17, and then we repeat the cycle). Clearly, if a membrane does not contain 
an object Cj, then that membrane will stop evolving at the time when Cj is 
supposed to reach the subscript 1. In this way, after 2m steps we can find 
whether there is a membrane which contains all objects ei, 62 , • • • , em- The 
membranes with this property, and only they, will get the object gm+i- 

18. [ 25771 + 1 ]° ^ [2 ]lyes. 

19. [-^yes]° [^ ]iVes. 

The object gm+i exits the membrane which corresponds to a vertex cover 
with cardinality k, producing the object yes. This object is then sent to the 
environment, telling us that there exists a vertex cover with cardinality k, 
and the computation stops. Note that the application of rule 19 changes the 
polarization of the skin membrane to positive in order that the objects yes 
remaining in it are not able to continue exiting it. 

From the previous explanation of the use of rules, one can easily see how this 
P system works. It is clear that we get the object yes if only if there exists a vertex 
cover of size k. The object yes exits the system at moment n + 2k + 2m + 8. 
If there exists no vertex cover of size k, then the computation stops earlier 
and we get no output, because no membrane was able to produce the object 
gm+i, hence the object yes. Therefore, the family of membrane systems we 
have constructed is sound, confluent, and linearly efficient. To prove that the 
family is semi-uniform in the sense of Section 1 , we have to show that for a given 
instance, the construction described in the proof can be done in polynomial time 
by a Turing machine. We omit the detailed construction due to the fact that it 
is straightforward although cumbersome as explained in the proof of Theorem 
7.2.3 in [4]. 

So the vertex cover problem was decided in linear time {n + 2k + 2m + 8) by 
P systems with restricted elementary active membranes, and this concludes the 
proof. 
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7 Solving the Clique Problem 

Given a graph G with n vertices and m edges, a clique of G is a subset of the 
vertices such that every vertex is connected to every other vertex by an edge; the 
clique problem (denoted by CP) asks whether or not there exists a clique for G 
of size k, where A: is a given integer less than or equal to n. The clique problem 
is an NP-complete problem [1]. 

For a graph G = {V,E), we define the complementary graph G' = {V,E'), 
where E' = {(vi,Vj) ^ E \ Vi,Vj G V}. In figure 2, an example of a graph with 
five vertices and five edges is given. 

In this section we will solve the clique problem in linear time (3n + 2k + 12) 
by recognizing P systems with restricted elementary active membranes. 

Theorem 3. The clique problem can be solved in linear time with respect to the 
number of vertices of the graph by recognizing P systems with restricted elemen- 
tary active membranes. 



1 1 





Figure 2. The original graph (a) and its complementary graph (b). 

Proof. Let G = (V,E) be a graph with n vertices and m edges, V = {vi,V 2 , 
••• ,Vn}, E = {ei,e 2 ,--- ,em}- The instance of the problem is encoded by a 
(multi)set in the alphabet E{{n,k)) = {p*j,„+ 2 fc+ 5 , eij,„+ 2 fc +5 | 1 < LJ < «, * 7 ^ 
j}. The object Pi,j,n+ 2 fc +5 stands for {vi,Vj) G F, while eij-^„+ 2 fe +5 represents 
{vi,Vj) ^ E. For given values of n and k we construct a recognizing P system 
(7T((n, fc)), E{{n, fc)), i((n, k))) with f((n, k)) = 2 

n{{n,k)) = {0{{n,k)),H,fi,wi,W2,R), 

where 

A;)) {aj, di^ ti^ fi^ yi^ . 2^2 | 1 ^ f ^ ri} 

U {hi, h'i \ 0 < i < k}U {gi | 0 < f < n + 3} 

U {eijj I l<i<n + 2, l<j<n + 2,0<Z<n + 2A: + 5} 

U [ci I 0 < i < 3n + 2fc + 11} U {a, 5, g,p, q, u, yes, no}, 

H = {1,2}, 

A^=[J2]2]?. 



Wl — C3„+2fe+ll, 

U>2 = 0102 • • • Qndi, 
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and the set R contains the following rules (we also give explanations about the 
use of these rules during the computations): 

1. [aOi]” ^ [2^i]2[2/i]2> 1 < * < 

2 . [^d^ d^+i]\, l<i<n-l. 

3. [^dn^qqho]^. 

We omit the explanations about the use of the rules of types 1, 2, and 3, 
because they are the same as the explanations in the proof of Theorem 2. 
Now we pass to the next phase of computation ~ counting the number of 
objects ti (1 < t < n) in each membrane with label 2, which corresponds to 
the cardinality of each subset. 



4 . 


[ 29 ] ^ 


[2 ] 2 


5 . 


[29] 2 




6 . 


[2^1 * 


■ o 6 ] 2 , 1 < i < n. 


7 . 


[2^1 — 


> ft)] 2, 0 <i < k. 


8 . 


[2^^ - 


> fti+i]^, 0 < t < 


9 . 


[2“] 2 ■ 


[2 ] 2 


10 . 


[2^]^ 


r 1 0 
[2 



The rules of types 4-10 are used in the same way as the corresponding rules 
from the proof of Theorem 2, the only difference is that at step n -I- 2, U 
evolves to ab. 

The rules of types 7, 8, 9, and 10 are applied as many times as possible (in 
one step rules of types 7 and 9, in the next one rules of types 8 and 10, and 
then we repeat the cycle). Clearly, at step n -I- 2 -|- 2k, a membrane contains 
object hk if and only if the cardinality of the corresponding subset is at least 
k. At step n -b 3 -b 2k, in the membrane whose corresponding subset has 
cardinality more than k, hk evolves to ft-)., and one copy of a changes its 
polarization to positive. This membrane will no longer evolve, as no further 
rule can be applied to it. In the membrane whose corresponding subset has 
cardinality exactly k, hk evolves to ft)., and its polarization remains neutral, 
because there is no copy of a which can be used. We pass to the next phase 
of computation - checking whether a subset with cardinality ft is a clique. 
(A subset A of vertices is a clique if and only if for each edge (vi,Vj) G E' , 
Xi G V — A or Xj G V — A, i.e., E' CV x {V — A)\J{V — A) xV . The process 
of checking whether a subset with cardinality ft is a clique is based on this 
fact.) 

11 . [2^)0^995] 2 - 

12. [ 2 /* ^ l<i<n. 

13- [29 9 o ] 2 5 ^ 

[ 2 ®*db ^ , 1 < i, j < n, 1 < Z < n -b 2fc -b 5, a G {-b, - ,0}. 

At step n -b 4 -b 2k, in the membranes with label 2 and polarization 0, ft), 
evolves to qqg. At step n-b5-b2fc, one copy of q exits the membrane, changing 
its polarization to negative. At step n -b 6 -b 2k, in parallel, fi {I < i < n) 
evolves to UiZi, g evolves to go and an object appears for each (vi, Vj) G 
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E' . At same time the other copy of q exits the membrane, changing its 
polarization to neutral. 

14. [^y^ yi+i]% I <i <n - 1. 

15. ^ Zi+i]°, l<i<n-l. 

16. [ 2 e*,j,o ^ es{i),s{j),o]l, 1 ^ ^ where s{t) =min(t+ l,n + 2). 

17- [29i 5i+l] 2 ’ 0 < t < n - 1. 

18. [^Zn^pjl- 
19- [ 2 ?/"] 2 ^ [2 ] 2 

At step n + 2k + 7, yi, Zi (1 < z < n — 1) evolve to j/i+i, z^+i, go evolves to 
gi, Zn evolves to p, Cij^o (1 < hj < n) evolve to es(i)^s(j),o> where s(t) = 
min(t + l,n + 2); at same time, exits the membrane where it appears, 
changing the polarization of that membrane to positive. 

20. [ 2 ^ 2 , 71 + 1,0 ^ ^[ 2 ’ l^Z^77-“t“l. 

21 . [ 2671 + 1 , 7,0 ^ ^[ 2 ’ l^z^7r~t“l. 

22 . [^p]^ [2 ]lu. 

In the membranes with label 2 and positive polarization (i.e., the membranes 
where appear in the last step, this means that vertex +„ does not belong 
to the corresponding subset), 6i,„+i,o and e„+i,i,o (1 < i < n + 1) evolve to 
u (which will never evolve again); at the same time, object p exits the mem- 
brane, returning the polarization of the membrane to neutral (this makes 
possible the use of rules of types 14, 15, 16, 17, 18, and 19). 

In the membranes with label 2 and neutral polarization (i.e., the membranes 
where do not appear in the last step, this means that vertex Vn belongs 
to the corresponding subset), using rule of type 16, 6i,„_|_i,o and e„+i,+o 
(1 < z < n+ 1 ) evolve to 6 i+i,„+ 2 ,o and e„+ 2 ,i+i,o (this means that the edges 
e+n and e„,i do not belong ioVx{V — A)[J{V — A)xV, where A is the 
corresponding subset). 

The rules of types 14, 15, 16, 17, 18, 19, 20, 21, and 22 are applied as many 
times as possible (in one step rules of types 14, 15, 16, 17, 18, and 19, in 
the next one rules of types 20, 21, and 22, and then we repeat the cycle). 
In this way, after 2n steps, a membrane will contain an object e„+ 2 ,n+ 2 ,o 
if and only if that membrane contains an edge which does not belong to 
V X {V — A) {V — A) X V, where A is the corresponding subset. In the 
following steps, we will let the membranes corresponding to a positive answer 
send out an object. 

23 . [2971 ^ 9 n+iq] 2’ 

24. [ 2 ^ 71+1 ^ 5n+2] 2- 

[29n-\-2 ^ ^ 71 + 3 ] 2 ■ 

26. [ 2^?T.+2,n+2,o] 2 ^ 1 2 ] 2 

27 . [2^71+3] 2 ^[212 

28. [;^yes]° ^ [^ ])^yes. 

At step 3n -I- 2fc -I- 7, gn evolves to gn+iQ- At step 3n -I- 2fc -I- 8, gn+i evolves 
to gn+ 2 , and at the same time, object q exits the membrane, changing the 
polarization to negative (using rule of type 4). At step 3n + 2k + 9, in the 
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membranes which contain object e„+ 2 ,n+ 2 ,o, ffn +2 evolve to g-n+s, and at the 
same time, e„+ 2 ,n+ 2 ,o exit these membranes, changing the polarization of 
these membranes to positive, they will never evolve again; in the membranes 
which do not contain object e„+ 2 ,n+ 2 , 0 ) 9 n +2 evolve to gn+s, the polarization 
of these membranes remains negative, in the next step they will produce the 
object yes. This object is then sent to the environment, telling us that there 
exists a clique with cardinality k, and the computation stops. The application 
of rule 28 changes the polarization of the skin membrane to positive in order 
that the objects yes remaining in it are not able to continue exiting it. 

29. [ ^ Ci_i] p 1 < t < 3n + 2/c + 11. 

30. [iCo]° ^ ]°no. 

At step 3n + 2fc + 11, the skin membrane has object cq, originating in 
C 3 n+ 2 fe+ii- Note that if there exists no vertex cover of size k, then at this mo- 
ment the polarization of the skin membrane is neutral. In the next step, cq 
produces the object no, and the object no is sent to the environment, telling 
us that there exists no clique with cardinality k, and the computation stops. 

From the previous explanation of the use of rules, one can easily see how 
this P system works. It is clear that we get the object yes if only if there exists 
a clique of size k. The object yes exits the system at moment 3n+ 2k + 11. If 
there exists no vertex cover of size k, then at step 3n + 2k+ 12 the system sends 
the object no to the environment. Therefore, the family of membrane systems 
we have constructed is sound, confluent, and linearly efficient. To prove that the 
family is uniform in the sense of Section 1, we have to show that for a given size, 
the construction of P systems described in the proof can be done in polynomial 
time by a Turing machine. Again, we omit the detailed construction. 

So the clique problem was decided in linear time (3n-|-2fc-|- 12) by recognizing 
P systems with restricted elementary active membranes, and this concludes the 
proof. 



8 Conclusions 

We have shown that the vertex cover problem and the clique problem can be 
solved in linear time with respect to the number of vertices and edges of the 
graph by recognizing P systems with restricted elementary active membranes. It 
is also interesting to solve other related graph problems, in a “uniform” manner, 
by P systems which are the same, excepting a module specific to each problem. 

The solution presented in Section 5 differs from the solution in Section 6 in 
the following sense: a family of recognizing P systems with active membranes is 
constructed, associated with the problem that is being solved, in such a way that 
all the instances of such problem that have the same length (according to a given 
polynomial-time computable criteria) are processed by the same P system (to 
which an appropriate input, that depends on the concrete instance, is supplied). 
On the contrary, in the solutions presented in Section 6, a single P system is 
associated with each one of the instances of the problem. 
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Let us denote by AA4 the class of recognizing P systems with active mem- 
branes and with 2-division. Then from Theorem 1 and Theorem 2 we have 
VCP G LMC_ 4 m C PMC^At and CP G LMC^At C PMC^At- Be- 
cause the class PMC_ 4 ^ is stable under polynomial time reduction, we have 
NP C PMC_ 4 ^. Similarly, from Theorem 3 we have VCP G PMC^_a^ 
and NP C PMC^_,y^, which can also be obtained from VCP G PMC^_a 4 , 
NP C PMC_ 4 _a 4 and the fact PMC^^ C PMC^_a 4 . 

The following problems are open: 

(1) Does NP = PMC^At or NP = PMC^_,n hold? 

(2) Find a class IF of P systems such that NP = PMCjr or NP = PMCf-. 
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Abstract. The time is approaching when information can be written 
into DNA. This tutorial work surveys the methods for designing code 
words using DNA, and proposes a simple code that avoids unwanted 
hybridization in the presence of shift and concatenation of DNA words 
and their complements. 



1 Introduction 

As bio- and nano-technology advances, the demand for writing information into 
DNA increases. Areas of immediate application are: 

— DNA computation which attempts to realize biological mathematics, i.e., 
solving mathematical problems by applying experimental methods in molec- 
ular biology [1]. Because a problem must be first encoded in DNA terms, the 
method of encoding is of crucial importance. Typically, a set of fixed-length 
oligonucleotides is used to denote logical variables or graph components. 

~ DNA tag /antitag system which designs fixed- length short oligonucleotide 
tags for identifying biomolecules (e.g., cDNA), used primarily for monitoring 
gene expressions [2,3,4]. 

~ DNA data storage which advocates the use of bacterial DNA as a long-lasting 
high-density data storage, which can also be resistant to radiation [5]. 

— DNA signature which is important for registering a copyright of engineered 
bacterial and viral genomes. Steganography (an invisible signature hidden in 
other information) is useful for the exchange of engineered genomes among 
developers. 

These fields are unlike conventional biotechnologies in that they attempt to 
encode artificial information into DNA. They can be referred to as ‘encoding 
models’. Although various design strategies for DNA sequences have been pro- 
posed and some have been demonstrated, no standard code like the ASCII code 
exists for these models, presumably because data transfer in the form of DNA 
has not been a topic of research. In addition, requirements for DNA sequences 
differ for each encoding model. 

In this tutorial work, the design of DNA words as information carriers is 
surveyed and a simple, general code for writing information into biopolymers is 
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proposed. After this introduction, Section 2 introduces major constraints consid- 
ered in the word design. In Section 3, three major design approaches are briefly 
overviewed and our approach is described in Section 4. Finally, Section 5 shows 
an exemplary construction of DNA code words using our method. 



2 Requirements for a DNA Code 



DNA sequences consist of four nucleotide bases (A: adenine, C: cytosine, G: gua- 
nine, and T: thymine), and are arrayed between chemically distinct terminals 
known as the 5’- and 3’-end. The double-helix DNA strands are formed by a 
sequence and its complement. The complementary strand, or complement, is ob- 
tained by the substitution of base A with base T, and base C with base G and vice 
versa, and reversing its direction. For example, the sequence 5 ’ -AAGCGCTT-3 ’ 

5' - AAGCGCTT - 3' 



is the complement of itself: 



3' - TTCGCGAA - 5' 



A non-complementary base in 



a double strand cannot form stable hydrogen bonds and is called a (base) mis- 
match. The stability of a DNA double helix depends on the number and distri- 
bution of base mismatches [6] . 

Now consider a set of DNA words for information interchange. Each word 
must be as distinct as possible so that no words will induce unwanted hybridiza- 
tion {mishybridization) regardless of their arrangement. At the same time, all 
words must be physico-chemically uniform {concerted) to guarantee an unbiased 
reaction in biological experiments. 

In principle, there are two measures for evaluating the quality of designed 
DNA words: statistical thermodynamics and combinatorics. Although the ther- 
modynamic method may yield a more accurate estimation, its computational 
cost is high. Therefore, since combinatorial estimations approximate the ther- 
modynamic ones, the focus in this work is on the former method, described in 
terms of discrete constraints that DNA words should satisfy. In what follows, 
formal requirements for the DNA word set will be introduced. 



2.1 Constraints on Sequences 

DNA words are assumed to be of equal length. This assumption holds true in 
most encoding models. (Some models use oligonucleotides of different lengths 
for spacer- or marker sequences. As such modifications do not change the nature 
of the problem, their details are not discussed here.) The design problem posed 
by DNA words has much in common with the construction of classical error- 
correcting code words. 

Let X = X\X2 • • • be a DNA word over four bases {A,G,G,T}. The reverse 
of X is denoted x^ = • • • a;i, and the complement of x, obtained by 

replacing base A with T, and base G with G in a; and vice versa, is denoted 
a;^. The Hamming distance H{x,y) between two words x = X\X 2 ■ ■ - Xn and 
y = 2 / 12/2 J/n is the number of indices i such that Xi ^ yi. For a set of DNA 
words S, is its complementation with reverse complement sequences, i.e., 
{a; I a; G S' or (a;^)*" G S}. 
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Hamming Constraints As in code theory, designed DNA words should keep 
a large Hamming distance between all word pairs. What makes the DNA code- 
design more complicated than the standard theory of error-correcting code is 
that we must consider not only H{x,y) but also H{x ^ to guarantee the 
mismatches in the hybridization with other words and their complements (Fig 1). 



110100 

> 

110100 






110100 00101 -^ 

" ^ 

1--- {AT}, 0 ■■■ {GO} 



Fig. 1. Binary Pattern of Hybridization. The complementary strand has a re- 
verse pattern of {A,T} and {G,C} bases. A reverse complement of a DNA word 
corresponds to its complementary strand. 



Comma- Free Constraints It is desirable for the designed words to be comma- 
free because DNA has no fixed reading frames. By definition, a code S is comma- 
free if the overlap of any two, not necessarily different, code words xiX 2 ■ ■ ■ Xn G S 
and yij /2 ■ ■ 'Vn & S, (i.e., Xr+iXr+2 ■ ■ ■ Xnyiy 2 yr', 0 < r < n) does not result 
in another code word in S [7,8]. The property by which any overlapping word 
differs from another word in at least d positions is called comma-free with index 
d. Thus, our DNA code should be comma-free with a high index. ^ 

Note that comma-freeness is not replaced by introducing predefined ‘spacer’ 
words between code words. Such spacers may facilitate the decoding of words, 
but they do not contribute to the avoidance of mishybridization. Moreover, spac- 
ers lengthen the encoded DNA and lower its information content. 



Energy Constraints In addition to the above constraints on mismatches, the 
melting temperatures of DNA words must be very similar to guarantee their 
concerted behavior in vitro. The most reliable estimation is the nearest neighbor 
approximation, where the temperature is computed from the frequency of 16 
base dimers (from AA to TT) [12,6]. Arita and Kobayashi proposed its further 
approximation by grouping [GC] and [AT], where the temperature depends on 
the frequency of only 3 patterns ([GC][GC], [GC][AT] or [AT][GC], and [AT][AT]) [13]. 
Dimer frequency of a sequence x is the three tuple of integers, each describing 
the frequency of the above 3 patterns in this order. To integrate the terminal 

^ The idea of comma-freeness originated in the elucidation of DNA translation mech- 
anism. Early on, DNA codons for 20 amino acids were thought to be encoded in 
the comma- free manner [9]. Incidentally, the number of comma- free code words of 
length 3 over 4 bases is at most 20. The systematic design of a comma-free code of 
index 1 was soon proposed [10,11]. 
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bases, we assume as if x is cyclic in the computation of frequency. For example, 
AAGCGCTT and TACGGCAT exhibit close melting temperatures because they share 
the same dimer frequency (3,2,3). Thus, all DNA code words should share the 
same dimer frequency to guarantee their concerted behavior. 



Other Constraints Depending on the model used, there are constraints in 
terms of base mismatches. We focus on the first 2 constraints in this paper. 

1. Forbidden subwords that correspond to restriction sites, simple repeats, or 
other biological signal sequences, should not appear anywhere in the designed 
words and their concatenations. This constraint arises when the encoding 
model uses pre-determined sequences such as genomic DNA or restriction 
sites for enzymes. 

2. Any subword of length k should not appear more than once in the designed 
words. This constraint is imposed to ensure the avoidance of base pair nu- 
cleation that leads to mishybridization. The number k is usually > 6. 

3. A secondary structure that impedes expected hybridization of DNA words 
should not arise. To find an optimal structure for these words, the mini- 
mum free energy of the strand is computed by dynamic programming [14]. 
However, the requirement here is that the words do not form some struc- 
ture. This constraint arises when temperature control is important in the 
encoding models. 

4. Only three bases, A,C, and T, may be used in the word design. This constraint 
serves primarily to reduce the number of mismatches by biasing the base 
composition, and to eliminate G-stacking energy [15]. In RNA word design, 
this constraint is important because in RNA, both G-C pairs and G-U pairs 
(equivalent to G-T in DNA) form stably. 



2.2 Data Storage Style 

Because there is no standard DNA code, it may seem premature to discuss meth- 
ods of aligning words or their storage, i.e., their data-addressing style. However, 
it is worth noting that the storage style depends on the word design; the im- 
mobilization technique, like DNA chips, has been popular partly because its 
weaker constraint on words alleviates design problems encountered in scaling up 
experiments. 



Surface-Based Approach In the surface-based (or solid-phase) approach, 
DNA words are placed on a solid support (Fig 2). This method has two ad- 
vantages: (1) since one strand of the double helix is immobilized, code words can 
be separated (washed out) from their complements, thereby reducing the risk of 
unexpected aggregation of words [16]; (2) since fluorescent labeling is effective, 
it is easier to recognize words, e.g., for information readout. 
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Fig. 2. The Surface-Based versus the Soluble Approach. While they are indis- 
tinguishable in solution, immobilization makes it easy to separate information 
words (gray) from their complements (black). 

Soluble Approach Easier access to information on surfaces simultaneously 
limits the innate abilities of biomolecules. DNA fragments in solution acquire 
more flexibility as information carriers, and have been shown capable of sim- 
ulating cellular automata [17]. Other advantages of the soluble approach are: 
(1) it opens the possibility of autonomous information processing [18]; (2) it is 
possible to introduce DNA words into microbes. The words can also be used for 
nano structure design. 

Any systematic word design that avoids mishybridization should serve both 
approaches. Therefore, word constraints must extend to complements of code 
words. Our design problem is then summarized as follows. 

Problem: Given two integers I and d {I > d > 0), design a set S 
of length-Z DNA words such that is comma-free with index d and 
for any two sequences x,y G , H{x,y) > d and H{x^,y^) > d. 
Moreover, all words in share the same dimer frequency. 

3 Previous Works 

Due to the different constraints, there is currently no standard method for de- 
signing DNA code words. In this section, three basic approaches are introduced: 
(1) the template-map strategy, (2) De Bruijn construction, and (3) the stochastic 
method. 




3.1 Template-Map Strategy 

This simple yet powerful construction was apparently first proposed by Condon’s 
group [16]. Constraints on the DNA code are divided and separately assigned 
to two binary codes, e.g., one specifies the GC content (called templates), the 
other specifies mismatches between any word pairs (called maps). The product 
of two codes produces a quaternary code with the properties of both codes (Fig 
3). Frutos et al. designed 108 words of length 8 where (1) each word has four 
GCs; (2) each pair of words, including reverse complements, differs in at least 
four bases [16]. Later, Li et ah, who used the Hadamard code, generalized this 
construction to longer code words that have mismatches equal to or exceeding 
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half their length [19]. They presented, as an example, the construction of 528 
words of length 12 with 6 minimum mismatches. 



Templates : 



AACCACCA 

ACCAAACC 

CCAACAAC 

CAACCCAA 

ACACACAC 

AAAACCCC 



X 



Maps : 
00000000 
10100101 
10101010 
01010101 
01011010 



AACCACCA template 
X 10100101 map 
TAGCAGCT 8-mer 



Fig. 3. Template-Map Strategy. In this figure, templates specify that the se- 
quences contain 50% GCs and four mismatches between them and their comple- 
ments. Maps are error-correcting code words and specify the choice between A 
and T, or G and G. 



The drawback of this construction is twofold. First, the melting tempera- 
tures of the designed quaternary words may differ regardless of their uniform 
GG content. This property was analyzed in Li et al. and the predicted melting 
temperatures of the 528 words differed over 20 °G range [19] . The second problem 
is the comma-freeness. Although the design has been effectively demonstrated 
in the surface-based approach, scaling up to multiple words will be difficult due 
to mishybridization. 

3.2 De Bruijn Construction 

The longer a consecutive run of matched base pairs, the higher is the risk of 
mishybridization. The length-A: subword constraint to avoid mishybridization is 
satisfied with a binary De Bruijn sequence of order A:, a circular sequence of 
length 2^ in which each subword of length k occurs exactly once. ^ A linear time 
algorithm for the construction of a De Bruijn sequence is known [20]. Ben-Dor 
et al. showed an optimal choosing algorithm of oligonucleotide tags that satisfy 
the length-A: subword constraint and also share similar melting temperatures [4] . 

One disadvantage is that the number of mismatches between words may be 
small, because the length-fc constraint guarantees only one mismatch for each 
fc-mer. Another disadvantage is again the comma-freeness. 

3.3 Stochastic Method 

The stochastic method is the most widely used approach in word design; there 
are as many types of design software as there are reported experiments. 

^ De Bruijn sequence can be quaternary. By using the binary version, however, we can 
easily satisfy the constraint that the subword does not occur in the complementary 
strand. 



Writing Information into DNA 



29 



Deaton et al. used genetic algorithms to find code words of similar melting 
temperatures that satisfy the ‘extended’ Hamming constraint, i.e., a constraint 
where mismatches in the case of shift are also considered [21]. (The constraint 
they named the H-measure, is different from comma-freeness in that it considers 
mismatches between two words, not their overlaps.) Due to the complexity of 
the problem, they reported that genetic algorithms can be applied to code words 
of up to length 25 [22] . 

Landweber et al. used a random word-generation program to design two sets 
of 10 words of length 15 that satisfy the conditions (1) no more than five matches 
over a 20-nucleotide window in any concatenation between all 2^° combinations; 
(2) similar melting temperatures of 45 °C; (3) avoidance of secondary structures; 
and (4) no consecutive matches of more than 7 base pairs. ^ All of the strong 
constraints could be satisfied with only 3 bases [15]. Other groups that employed 
three-base words likewise used random word-generation for their word design 
[24,23]. 

Although no detailed analyses for such algorithms are available, the power 
of stochastic search is evident in the work of Tulpan et ah, who could increase 
the number of code words designed by the template-map strategy [25]. However, 
they reported that the stochastic search failed to outperform the template-map 
strategy if searches were started from scratch. Therefore it is preferable to apply 
the stochastic method to enlarge already designed word sets. 

4 Methods 

4.1 Comma-Free Error- Correcting DNA Code 

Among the different constraints on DNA code words, the most difficult to satisfy 
is comma-freeness; no systematic construction is known for a comma-free code of 
high index. The stochastic search is also not applicable because its computational 
cost is too high. 

The comma-free property is, however, a necessary condition for the design of 
a general-purpose DNA code. This section presents the construction method for a 
comma-free error-correcting DNA code, and proposes a DNA code: 112 words of 
length 12 that mismatch at at least 4 positions in any mishybridization, share no 
more than 6 consecutive subsequences, and retain similar melting temperatures. 



Basic Design For this design, we employed the method of Arita and Kobayashi 
[13]. It can systematically generate a set of words of length i such that any of 
its members will have approximately t'/3 mismatches with other words, their 
complements, and overlaps of their concatenations. They constructed sequences 
as a product of two types of binary words as in the template-map strategy, except 
that they used a single binary word, denoted T, as the template. Template T is 

® The fourth condition is unnecessary if the first one is satisfied; presented here are 
all conditions considered in the original paper. 
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chosen so that its alignment with subsequent patterns always contains equal to 
or more than d mismatches. 






^ j j ^ ^ j ^ j ' 



TT 






( 1 ) 



The template specifies the GC positions of the designed words: [GC] corre- 
sponds to either I’s or O’s in the template. Since the pattern specifies the 
AT/GC pattern of reverse complements, the mismatches between T and 
guarantee the base mismatches between forward strands and reverse strands of 
designed DNAs. Other patterns from TT to T^T^ are responsible for shifted 
patterns. 

For the map words, any binary error-correcting code of minimum distance 
d or greater is used. Then, any pair of words in the resulting quaternary code 
induces at least d mismatches without block shift because of the error-correcting 
code, and with block shift or reversal because of the chosen template. 

Gomma-freeness is not the only advantage of their method. Because a single 
template is used to specify GG positions for all words, the GG arrangement of 
resulting code words is uniform, resulting in similar melting temperatures for all 
words in the nearest neighbor approximation [13]. 



Other Constraints In this subsection, methods to satisfy other practical con- 
straints are introduced. 

Forbidden subword 

Since the error-correcting property of the map words is invariant under exchang- 
ing and 0-1 dipping columns of all words, this constraint can be easily satisfied. 

Length-k subword 

For the DNA words to satisfy this constraint, two additional conditions are nec- 
essary: First, the template should not share any length-fc subword with patterns 
in (1). Second, the map words should not share any length-fc subword among 
them. 

The first condition can be imposed when the template is selected. To satisfy 
the second condition, the obvious transformation from word selection to the Max 
Glique Problem is used: the nodes correspond to the words, and the edges are 
linked only when two words do not share any length-A: subword (without block 
shift). Note that the clique size is upper bounded by 2^. 

Secondary structure 

Since all words are derived from the same template, in the absence of shifts, 
the number of mismatches can be the minimum distance of the error-correcting 
code words. Hybridization is therefore more likely to proceed without shifts. To 
avoid secondary structures, the minimum distance of the error-correcting code 
words is kept sufficiently large and base mismatches are as much distributed 
as possible. The latter constraint is already achieved by imposing the length-fc 
subword constraint. 
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5 Results 

5.1 DNA Code for the English Alphabet 

Consider the design for the English alphabet using DNA. For each letter, one 
DNA word is required. One short error-correcting code is the nonlinear (12,144,4) 
code [26]. ^ Using a Max Clique Problem solver 32, 56, and 104 words could 
be chosen that satisfied the length-6, -7, -8 subword constraint, respectively. 

There are 74 template words of length 12 and of minimum distance 4; they are 
shown in the Appendix. Since 128 words cannot be derived from a single template 
under the subword constraint, two words, say S and T, were selected from the 
74 templates such that both S and T induce more than 3 mismatches with any 
concatenation of 4 words T, S', T^, and (16 patterns), and each chosen word 
shares no more than length-5, -6, or -7 subword with the other and with their 
concatenations. Under the length-6 subword constraint, no template pair could 
satisfy all constraints. Under the length-7, and -8 subword constraints, 8 pairs 
were selected. (See the Appendix.) All pairs had the common dimer frequency. 
Under this condition, DNA words derived from these templates can be shown to 
share close melting temperatures. 

Thus, we found 2 templates could be used simultaneously in the design of 
length-12 words. There were 8 candidate pairs. By combining one of 8 pairs 
with the 56 words in the Appendix, 112 words were obtained that satisfied the 
following conditions: 

— They mismatched in at least 4 positions between any pair of words and their 
complements. 

— The 4 mismatch was guaranteed under any shift and concatenation with 
themselves and their complements (comma-freeness of index 4) . 

— None shared a subword of length 7 or longer under any shift and concatena- 
tion. 

— All words had close melting temperatures in the nearest neighbor approxi- 
mation. 

— Because all words were derived from only two templates, the occurrence of 
specific subsequences could be easily located. In addition, the avoidance of 
specific subsequences was also easy. 

We consider that the 112 words serve as the standard code for the English 
alphabet. The number of words, 112, falls short of the 128 ASCII characters. 
However, some characters are usually unused. For example, HTML characters 
from &^14 to &#31 are not used. Therefore, the 112 words suffice for the DNA 
ASCII code. This is preferable to loosening the constraints to obtain 128 words. 

^ The notation (12,144,4) reads ‘a length-12 code of 144 words with the minimum 
distance 4’ (one error-correcting). 

® http : / /rtm .science. unitn .it / intertools / 




32 



Masanori Arita 



5.2 Discussion 

The current status of information-encoding models was reviewed and the neces- 
sity and difficulty of constructing comma-free DNA code words was discussed. 
The proposed design method can provide 112 DNA words of length 12 and 
comma-free index 4. This result is superior to the current standard because it 
is the only work that considers arbitrary concatenation among words including 
their complementary strands. 

In analyzing the encoding models, error and efficiency must be clearly distin- 
guished. Error refers to the impairment of encoded information due to experi- 
mental missteps such as unexpected polymerization or excision. Efficiency refers 
to the processing speed, not the accuracy, of experiments. 

Viewed in this light, the proposed DNA code effectively minimizes errors: 
First, the unexpected polymerization does not occur because all words satisfy 
the length-7 subword constraint. ® Second, the site of possible excision under 
the application of enzymes is easily identified. Lastly, all words have uniform 
physico-chemical properties and their interaction is expected to be in concert. 
The efficiency, on the other hand, remains to be improved. It can be argued that 
4 mismatches for words of length 12 are insufficient for avoiding unexpected 
secondary structures. Indispensable laboratory experiments are underway and 
confirmation of the applicability of the code presented here to any of the encoding 
models is awaited. 

Regarding code size, it is likely that the number of words can be increased 
by a stochastic search. 

Without systematic construction, however, the resulting code loses one good 
property, i.e., the easy location of specific subsequences under any concatenation. 

The error-correcting comma-free property of the current DNA words opens 
a way to new biotechnologies. Important challenges include: 1. The design of a 
comma-free quaternary code of high indices; 2. Analysis of the distribution of 
mismatches in error-correcting code words; and 3. The development of criteria 
to avoid the formation of secondary structures. 

Also important is the development of an experimental means to realize ‘DNA 
signature’. Its presence may forestall and resolve lawsuits on the copyright of en- 
gineered genomes. Currently when a DNA message is introduced into a genome, 
no convenient method exists for the detection of its presence unless the mes- 
sage sequence is known. In the future, it should be possible to include English 
messages, not ACGTs, on the input window of DNA synthesizers. 
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Appendix 

110010100000 llOOOlOlOOOOt 110000001010 110000000101 lOllOOlOOOOOt lOlOOlOOlOOQt 

101000010001 lOlOOOOOOllOt 100101000100+ 100100011000 100100000011 100011000010 

100010010100 100010001001 100001100001 + 100000110010 100000101100 + 011100000010 

011010000100 011000110000 + 011000001001 010110001000 010100100100 010100010001 

010011000001 010010010010 010001101000 010001000110 010000100011 + 010000011100 

001110010000 + 001101000001 + 001100001100 001010101000 + 001010000011 001001100010 

001001010100 + 001000100101 001000011010 + 000110100010 000110000101 000101110000 + 

000101001010 000100101001 + 000100010110 000011100100 000011011000 000010110001 + 

000010001110 000001010011 000001001101 + 001101011111 001110101111 001111110101 

001111111010 010011011111 + 010110110111 + 010111101110 + 010111111001 011010111011 + 

011011100111 011011111100 011100111101 + 011101101011 011101110110 011110011110 + 

011111001101 011111010011 + 100011111101 + 100101111011 100111001111 + 100111110110 

101001110111 101011011011 101011101110 101100111110 101101101101 101110010111 

101110111001 101111011100 + 101111100011 110001101111 110010111110 + 110011110011 + 

110101010111 110101111100 + 110110011101 110110101011 + 110111011010 + 110111100101 

111001011101 111001111010 111010001111 111010110101 111011010110 + 111011101001 

111100011011 111100100111 111101001110 + 111101110001 111110101100 111110110010 + 

000000000000 + 111111111111 + 000000111111 000011101011 + 000101100111 000110011011 + 

000110111100 001001111001 001010011101 001010110110 001100110011 + 001111000110 + 

010001110101 + 010010101101 + 010100001111 + 010100111010 010111010100 011000010111 

011000101110 011011001010 + 011101011000 + 011110100001 111111000000 111100010100 + 

111010011000 + 111001100100 111001000011 + 110110000110 110101100010 110101001001 

110011001100 110000111001 + 101110001010 + 101101010010 + 101011110000 101011000101 + 

101000101011 100111101000 100111010001 100100110101 + 100010100111 + 100001011110 



(12,144,4) Code. Daggers indicate 56 words that satisfy the length-7-subword 

constraint. 
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101001100000 011001010000 101101110000 101100001000 011101101000 110011101000 
001010011000 101110011000 111001011000 010110111000 001101000100 011101100100 
001111010100 001110110100 111010001100 110010101100 101111000010 111001100010 
010111100010 111100010010 011000001010 011010100110 100001110110 100100011110 
111010010001 110110010001 100110101001 101110000101 111000100101 110101000011 
110100100011 

Templates of Length 12. When their reversals and 01-flips are included, the total 

number of words is 74. 



000110011101 and 001010111100 000110011101 and 001111010100 
001010111100 and 101110011000 001111010100 and 101110011000 
010001100111 and 110000101011 010001100111 and 110101000011 
110000101011 and 111001100010 110101000011 and 111001100010 

Template Pairs Satisfying Minimum Distance 4 and Length-7-subword Constraint. 
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Abstract. We propose a natural computational model called a balance 
machine. The computational model consists of components that resemble 
ordinary physical balances, each with an intrinsic property to automati- 
cally balance the weights on their left, right pans. If we start with certain 
fixed weights (representing inputs) on some of the pans, then the balance- 
like components would vigorously try to balance themselves by hlling the 
rest of the pans with suitable weights (representing the outputs). This 
balancing act can be viewed as a computation. We will show that the 
model allows us to construct those primitive (hardware) components that 
serve as the building blocks of a general purpose (universal) digital com- 
puter: logic gates, memory cells (flip-flops), and transmission lines. One 
of the key features of the balance machine is its “bidirectional” operation: 
both a function and its (partial) inverse can be computed spontaneously 
using the same machine. Some practical applications of the model are 
discussed. 



1 Computing as a “Balancing Feat” 

A detailed account of the proposed model will be given in Section 2. In this 
section, we simply convey the essence of the model without delving into technical 
details. The computational model consists of components that resemble ordinary 
physical balances (see Figure 1), each with an intrinsic property to automatically 
balance the weights on their left, right pans. In other words, if we start with 
certain fixed weights (representing inputs) on some of the pans, then the balance- 
like components would vigorously try to balance themselves by filling the rest of 
the pans with suitable weights (representing the outputs) . Roughly speaking, the 
proposed machine has a natural ability to load itself with (output) weights that 
“balance” the input. This balancing act can be viewed as a computation. There is 
just one rule that drives the whole computational process: the left and right pans 
of the individual balances should he made equal. Note that the machine is designed 
in such a way that the balancing act would happen automatically by virtue of 
physical laws (i.e., the machine is self-regulating).^ One of our aims is to show 
that all computations can be ultimately expressed using one primitive operation: 

^ If the machine cannot (eventually) balance itself, it means that the particular in- 
stance does not have a solution. 
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balancing. Armed with the computing = balancing intuition, we can see basic 
computational operations in a different light. In fact, an important result of 
this paper is that this sort of intuition suffices to conceptualize/implement any 
computation performed by a conventional computer. 




Fig. 1. Physical balance. 



The rest of the paper is organized as follows: Section 2 gives a brief intro- 
duction to the proposed computational model; Section 3 discusses a variety of 
examples showing how the model can be made to do basic computations; Sec- 
tion 4 is a brief note on the universality feature of the model; Section 5 reviews 
the notion of bilateral computing and discusses an application (solving the classic 
SAT problem); Section 6 concludes the paper. 

2 The Balance Machine Model 

At the core of the proposed natural computational model are components that 
resemble a physical balance. In ancient times, the shopkeeper at a grocery store 
would place a standard weight in the left pan and would try to load the right pan 
with a commodity whose weight equals that on the left pan, typically through 
repeated attempts. The physical balance of our model, though, has an intrinsic 
self-regulating mechanism: it can automatically load (without human interven- 
tion) the right pan with an object whose weight equals the one on the left pan. 
See Figure 2 for a possible implementation of the self-regulating mechanism. 

In general, unlike the one in Figure 2, a balance machine may have more than 
just two pans. There are two types of pans: pans carrying fixed weights which re- 
main unaltered by the computation and pans with variable (fluid) weights that 
are changed by activating the filler-spiller outlets. Some of the fixed weights 
represent the inputs, and some of the variable ones represent outputs. The in- 
puts and outputs of balance machines are, by default, non-negative reals unless 
stated otherwise. The following steps comprise a typical computation by a given 
balance machine: 
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(fluid) source 




Fig. 2. A self-regulating balance. The source is assumed to have (an arbitrary amount 
of) a fluid-like substance. When activated, the filler outlet lets fluid from source into the 
right pan; the spi//er outlet, on being activated, allows the right pan to spill some of its 
content. The balancing is achieved by the following mechanism: the spiller is activated 
if at some point the right pan becomes heavier than left (i.e., when push button (2) 
is pressed) to spill off the extra fluid; similarly, the filler is activated to add extra 
fluid to the right pan just when the left pan becomes heavier than right (i.e., when 
push button (1) is pressed). Thus, the balance machine can “add” (sum up) inputs x 
and y by balancing them with a suitable weight on its right: after being loaded with 
inputs, the pans would go up and down till it eventually finds itself balanced. 



— Plug in the desired inputs by loading weights to pans (pans with variable 
weights can be left empty or assigned with arbitrary weights). This defines 
the initial configuration of the machine. 

~ Allow the machine to balance itself: the machine automatically adjusts the 
variable weights till left and right pans of the balance(s) become equal. 

— Read output: weigh fluid collected in the output pans (say, with a standard 
weighting instrument). 

See Figure 3 for a schematic representation of a machine that adds two quantities. 
We wish to point out that, to express computations more complex than addition, 
we would require a combination of balance machines such as the one in Figure 2. 
Section 3 gives examples of more complicated machines. 



3 Computing with Balance Machines: Examples 

In what follows, we give examples of a variety of balance machines that carry out 
a wide range of computing tasks — from the most basic arithmetic operations to 
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A 



Meaning 




weights on both sides of this “bar” 
should balance 



represents two (or more) pans 

whose weights add up 

(these weights need not balance each other) 



small letters, numerals represent fixed weights 

capital letters represent variable weights 

Fig. 3. Schematic representation of a simple balance machine that performs addition. 



solving linear simultaneous equations. Balance machines that perform the oper- 
ations increment, decrement, addition, and subtraction are shown in Figures 4, 5, 
6, and 7, respectively. Legends accompanying the figures give details regarding 
how they work. 

Balance machines that perform multiplication by 2 and division by 2 are 
shown in Figures 8, 9, respectively. Note that in these machines, one of the 
weights (or pans) takes the form of a balance machine.^ This demonstrates that 
such a recursive arrangement is possible. 

We now introduce another technique of constructing a balance machine: hav- 
ing a “common” weight shared by more than one machine. Another way of visual- 
izing the same situation is to think of pans (belonging to two different machines) 
being placed one over the other. We use this idea to solve a simple instance of 
linear simultaneous equations. See Figures 10 and 11 which are self-explanatory. 

An important property of balance machines is that they are bilateral comput- 
ing devices. See [1], where we introduced the notion of bilateral computing. Typ- 
ically, bilateral computing devices can compute both a function and its (partial) 
inverse using the same mechanism. For instance, the machines that increment 
and decrement (see Figures 4 and 5) share exactly the same mechanism, except 

^ The weight contributed by a balance machine is assumed to be simply the sum of 
the individual weights on each of its pans. The weight of the bar and the other parts 
is not taken into account. 
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X 1 



Fig. 4. Increment operation. Here x represents the input; Z represents the output. 
The machine computes increment{x). Both x and T’ are fixed weights clinging to the 
left side of the balance machine. The machine eventually loads into Z a suitable weight 
that would balance the combined weight of x and T’. Thus, eventually Z — x + 1, i.e., 
Z represents increment{x). 




Fig. 5. Decrement operation. Here « represents the input; X represents the output. 
The machine computes decrement{z) . The machine eventually loads into X a suitable 
weight, so that the combined weight of X and ‘T would balance «. Thus, eventually 
X + 1 = z, i.e., X represents decrement{z). 



for the fact that we have swapped the input and output pans. Also, compare 
machines that (i) add and subtract (see Figures 6 and 7) and (ii) multiply and 
divide by 2 (see Figures 8 and 9). 

Though balance machines are basically analog computing machines, we can 
implement Boolean operations (AND, OR, NOT) using balance machines, pro- 
vided we make the following assumption: There are threshold units that return 
a desired value when the analog values (representing inputs and outputs) exceed 
a given threshold and some other value otherwise. See Figures 12, 13, and 14 
for balance machines that implement logical operations AND, OR, and NOT re- 
spectively. We represent true inputs with the (analog) value 10 and false inputs 
with 5; when the output exceeds a threshold value of 5 it is interpreted as true, 
and as false otherwise. (Instead, we could have used the analog values 5 and 0 
to represent true and false; but, this would force the AND gate’s output to a 
negative value for a certain input combination.) Tables 1, 2, and 3 give the truth 
tables (along with the actual input/output values of the balance machines). 
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+ 




X 



y 



Fig. 6. Addition operation. The inputs are x and y\ Z represents the output. The 
machine computes x + y. The machine loads into Z a suitable weight, so that the 
combined weight of x and y would balance Z. Thus, eventually x -\-y = Z , i.e., Z would 
represent x -\- y. 



Fig. 7. Subtraction operation. Here 2 and x represent the inputs; Y represents the 
output. The machine computes z — x. The machine loads into Y a suitable weight, so 
that the combined weight of x and Y would balance 2 :. Thus, eventually x -\-Y = z, 
i.e., Y would represent z — x. 

4 Universality of Balance Machines 

The balance machine model is capable of universal discrete computation, in the 
sense that it can simulate the computation of a practical, general purpose digital 
computer. We can show that the model allows us to construct those primitive 
(hardware) components that serve as the “building blocks” of a digital computer: 
logic gates, memory cells (flip-flops) and transmission lines. 

1. Logic gates 

We can construct AND, OR, NOT gates using balance machines as shown in 
Section 3. Also, we could realize any given Boolean expression by connecting 
balance machines (primitives) together using “transmission lines” (discussed 
below) . 

2. Memory cells 

The weighting pans in the balance machine model can be viewed as “storage” 
areas. Also, a standard S-R flip-flop can be constructed by cross coupling 
two NOR gates, as shown in Figure 15. Table 4 gives its state table. They 
can be implemented with balance machines by replacing the OR, NOT gates 
in the diagram with their balance machine equivalents in a straightforward 



+ 




X 



Y 



manner. 
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Fig. 8. Multiplication by 2. Here a represents the input; A represents the output. The 
machine computes 2a. The combined weights of a and B should balance A: a + B — A; 
also, the individual weights a and B should balance each other: a — B. Therefore, 
eventually A will assume the weight 2a. 



A B 

Fig. 9. Division by 2. The input is a; A represents the output. The machine “exactly” 
computes a/2. The combined weights of A and B should balance a: A + B = a\ also, the 
individual weights A and B should balance each other: A = B. Therefore, eventually 
A will assume the weight a/2. 



3. Transmission lines 

A balance machine like machine (2) of Figure 16 that does nothing but a 
“copy” operation (copying whatever is on left pan to right pan) would serve 
both as a transmission equipment, and as a delay element in some contexts. 
(The pans have been drawn as flat surfaces in the diagram.) Note that the 
left pan (of machine (2)) is of the fixed type (with no spiller-flller outlets) 
and the right pan is a variable one. 

Note that the computational power of Boolean circuits is equivalent to that 
of a Finite State Machine (FSM) with bounded number of computational steps 
(see Theorem 3.1.2 of [4]).^ But, balance machines are “machines with memory”: 
using them we can build not just Boolean circuits, but also memory elements 
(flip-flops) . Thus, the power of balance machines surpasses that of mere bounded 
FSM computations; to be precise, they can simulate any general sequential cir- 
cuit. (A sequential circuit is a concrete machine constructed of gates and memory 
devices.) Since any finite state machine (with bounded or unbounded computa- 
tions) can be realized as a sequential circuit using standard procedures (see [4]), 
one can conclude that balance machines have (at least) the computing power of 
unbounded FSM computations. Given the fact that any practical (general pur- 
pose) digital computer with only limited memory can be modeled by an FSM, 
we can in principle construct such a computer using balance machines. Note, 
however, that notional machines like Turing machines are more general than 
balance machines. Nevertheless, standard “physics-like” models in the litera- 

® Also, according to Theorem 5.1 of [2] and Theorem 3.9.1 of [4], a Boolean circnit 
can simulate any T-step Turing machine computation. 
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Fig. 10. Solving simultaneous linear equations. The constraints Xi = X 2 and Vi = Y 2 
will be taken care of by balance machines (3) and (4). Observe the sharing of pans 
between them. The individual machines work together as a single balance machine. 



ture like the Billiard Ball Model[3] are universal only in this limited sense: there 
is actually no feature analogous to the infinite tape of the Turing machine. 

5 Bilateral Computing 

There is a fundamental asymmetry in the way we normally compute: while we 
are able to design circuits that can multiply quickly, we have relatively limited 
success in factoring numbers; we have fast digital circuits that can “combine” 
digital data using AND/OR operations and realize Boolean expressions, yet no 
fast circuits that determine the truth value assignment satisfying a Boolean ex- 
pression. Why should computing be easy when done in one “direction”, and 
not so when done in the other “direction”? In other words, why should invert- 
ing certain functions be hard, while computing them is quite easy? It may be 
because our computations have been based on rudimentary operations (like addi- 
tion, multiplication, etc.) that force an explicit distinction between “combining” 
and “scrambling” data, i.e. computing and inverting a given function. On the 
other hand, a primitive operation like balancing does not do so. It is the same 
balance machine that does both addition and subtraction: all it has to do is 
to somehow balance the system by filling up the empty variable pan (repre- 
senting output); whether the empty pan is on the right (addition) or the left 
(subtraction) of the balance does not particularly concern the balance machine! 
In the bilateral scheme of computing, there is no need to develop two distinct 
intuitions — one for addition and another for subtraction; there is no dichotomy 
between functions and their (partial) inverses. Thus, a bilateral computing sys- 
tem is one which can implement a function as well as (one of) its (partial) 
inverse(s), using the same intrinsic “mechanism” or “structure”. See [1] where 
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x + |/ = 8; X — y = 2 




X Y Y 2 



Fig. 11. Solving simultaneous linear equations (easier representation). This is a sim- 
pler representation of the balance machine shown in Figure 10. Machines (3) and (4) 
are not shown; instead, we have used the same (shared) variables for machines (1) and 
( 2 ). 




Fig. 12. AND logic operation, x and y are inputs to be ANDed; Z represents the 
output. The balance realizes the equality x + y — Z + 10. 



we have developed bilateral computing systems based on fluid mechanics and 
have given a mathematical characterization of such systems. 

We now show how the classic SAT problem can be solved under a bilateral 
computing scheme, using balance machines. For the time being, we make no 
claims regarding the time complexity of the approach since we have not analyzed 
the time characteristics of balance machines. However, we believe that it will not 
be exponential in terms of the number of variables (see also [1]). The main idea 
is this: first realize the given Boolean expression using gates made of balances; 
then, by setting the pan that represents the outcome of the Boolean expression 
to (the analog value representing) true, the balance machine can be made to 
automatically assume a set of values for its inputs that would “balance” it. In 
other words, by setting the output to be true, the inputs are forced to assume 
one of those possible truth assignments (if any) that generate a true output. The 
machine would never balance, when there is no such possible input assignment 
to the inputs (i.e., the formula is unsatisflable) . This is like operating a circuit 
realizing a Boolean expression in the “reverse direction” : assigning the “output” 
first, and making the circuit produce the appropriate “inputs”, rather than the 
other way round. 

See Figure 17 where we illustrate the solution of a simple instance of SAT 
using a digital version of balance machine whose inputs/outputs are positive 



Balance Machines: Computing = Balancing 



45 



Table 1. Truth table for AND. 



X 


1 y 1 


z 


5 (false) 
5 (false) 
10 (true) 
10 (true) 




0 (false) 
5 (false) 
5 (false) 
10 (true) 



Table 2. Truth table for OR. 



X 


y 


1 ^ 1 


5 (false) 
5 (false) 
10 (true) 
10 (true) 


5 (false) 
10 (true) 
5 (false) 
10 (true) 


B 



Table 3. Truth table for NOT. 



X 


Y 


5 (false) 
10 (true) 


10 (true) 
5 (false) 



Table 4. State table for S-R flip-flop. 



S 


R 


Q 


Q' 


0 


0 


previous state of Q 


previous state of Q’ 


0 


1 


0 


1 


1 


0 


1 


0 


1 


1 


undefined 


undefined 



integers (as opposed to reals). Note that these machines work based on the 
following assumptions: 

1. Analog values 10 and 5 are used to represent true and false respectively. 

2. The flller-spiller outlets let out fluid only in (discrete) “drops” , each weighing 
5 units. 

3. The maximum weight a pan can hold is 10 units. 

6 Conclusions 

As said earlier, one of our aims has been to show that all computations can be 
ultimately expressed using one primitive operation: balancing. The main thrust 
of this paper is to introduce a natural intuition for computing by means of a 
generic model, and not on a detailed physical realization of the model. We have 
not analysed the time characteristics of the model, which might depend on how 
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Fig. 13. OR logic operation. Here x and y are inputs to be ORed; Z represents the 
output. The balance realizes the equality x + y — Z + b. 




15 



Fig. 14. NOT logic operation. Here x is the input to be negated; Y represents the 
output. The balance realizes the equality x + Y = 15. 



5 



R 




Q' 



Q 



Fig. 15. S-R flip-flop constructed using cross coupled NOR gates. 



we ultimately implement the model. Also, apart from showing with illustrative 
examples various possible (valid) ways of constructing balance machines, we have 
not detailed a formal “syntax” that governs them. 

Finally, this note shows that one of the possible answers to the question 
“What does it mean to computeT’’ is: “To balance the inputs with suitable out- 
puts (on a suitably designed balance machine).” 
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Fig. 16. Balance machine as a “transmission line”. Balance machine (2) acts as a 
transmission line feeding the output of machine (1) into the input of machine (3). 



Satisfiability of (a + b) (a' + b) 





Truth table 



n 


a 




(a + b){a' + b) 


15 


0 


0 


0 




0 


1 


1 




1 


0 


0 




1 


1 


1 



Fig. 17. Solving an instance of SAT: The satishability of the formula (a + b){a' + b) 
is verified. Machines (1), (2) and (3) work together sharing the variables A, B and A' 
between them. OR gates (labeled 1 and 2) realize (a + 6) and (a' + 6) respectively and 
the NOT gate (labeled 3) ensures that a and a' are “complementary”. Note that the 
“output” of gates (1) and (2) are set to 10. Now, one has to observe the values eventually 
assumed by the variable weights A and B (that represent “inputs” of OR gate (1)). 
Given the assumptions already mentioned, one can easily verify that the machine will 
balance, assuming one of the two following settings: (i) A = 5, B = 10, (Extrai = 0, 
Extra 2 = 5) or (ii) A — IQ, B — 10, {Extras = 5, Extra 2 — 0). These are the only 
configurations that make the machine balanced. In situations when both the left pans 
of gate (1) assume 10, Extras will automatically assume 5 to balance off the extra 
weight on the left side. {Extra 2 plays a similar role in gate (2).) 
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Abstract. A class of P systems, called EOP systems, with symbol ob- 
jects processed by evolution rules distributed alongside the transitions 
of an Eilenberg machine, is introduced. A parallel variant of EOP sys- 
tems, called EOPP systems, is also defined and the power of both EOP 
and EOPP systems is investigated in relationship with three parameters: 
number of membranes, states and set of distributed rules. It is proven 
that the family of Parikh sets of vectors of numbers generated by EOP 
systems with one membrane, one state and one single set of rules coin- 
cides with the family of Parikh sets of context-free languages. The hier- 
archy collapses when at least one membrane, two states and four sets of 
rules are used and in this case a characterization of the family of Parikh 
sets of vectors associated with ETOL languages is obtained. It is also 
shown that every EOP system may be simulated by an EOPP system 
and EOPP systems may be used for solving NP-complete problems. In 
particular linear time solutions are provided for the SAT problem. 



1 Introduction 

P systems were introduced by Gh. Paun [12] as a computational model in- 
spired by the structure and functioning of the cell. A central role in this con- 
text is played by membranes delimiting regions and allowing or preventing the 
transport of different molecules and chemicals among them. Different classes 
of P systems dealing with string objects or symbol objects, considering sets 
or multisets of elements leading to various families of languages were inves- 
tigated [13] (an up-to-date bibliography of the whole area may be found at 
http://psystems.disco.unimib.it/). Because rewriting alone even in the context 
of a highly parallel environment of a membrane structure is not enough to lead 
to characterizations of recursively enumerable languages, various other features 
have been considered, such as a priority relationship over the set of rules, permit- 
ting or forbidding conditions associated with rules, restrictions on the derivation 
mode, the possibility to control the membrane permeability [7] etc (for more de- 
tails see [13]). In general the most used priority relationship on the set of rewrit- 
ing rules is a partial order relationship, well studied in the context of generative 
mechanisms with restrictions in derivation [5]. 

In [1] the priority relationship were replaced by a transition diagram associ- 
ated with an Eilenberg machine giving birth to two classes of Eilenberg systems, 
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a sequential version and a parallel one, called EP systems and EPP systems, 
respectively. In both variants, each transition has a specific set of evolution rules 
acting upon the string objects contained in different regions of the membrane 
system. The power of both variants working with string objects was investigated 
as well as the suitability of EPP systems to solve hard problems. In this paper 
multisets of symbol objects are considered and the corresponding of Eilenberg 
P systems are called EOP systems and EOPP systems. The definition and the 
behaviour of EOP and EOPP systems are very similar to those of EP and EPP 
systems, respectively. More precisely, the system will start in a given state and 
with an initial set of symbol objects. Given a state and a current multiset of 
symbol objects, in the case of EOP systems, the machine will evolve by applying 
rules associated with one of the transitions going out from the current state. 
The system will resume from the destination state of the current transition. In 
the parallel variant, instead of one state and a single multiset of symbol objects 
we may have a number of states, called active states, that are able to trigger 
outgoing transitions and such that each state hosts a different multiset of symbol 
objects; all the transitions emerging from every active state may be triggered 
once the rules associated with them may be applied; then the system will resume 
from the next states, which then become active states. EOP systems are mod- 
els of cells evolving under various conditions when certain factors may inhibit 
some evolution rules or some catalysts may activate other rules. Both variants 
dealing with string objects and symbol objects have some similarities with the 
grammar systems controlled by graphs [4], replacing a one-level structure, which 
is the current sentential form, with a hierarchical structure defined by the mem- 
brane system. On the other hand, these variants of P systems may be viewed 
as Eilenberg machines [6] having sets of evolution rules as basic processing re- 
lationships. EP and EOP systems share some similar behaviour with Eilenberg 
machines based on distributed grammar systems [8]. 

Eilenberg machines, generally known under the name of X machines [6], have 
been initially used as a software specification language [9], further on intensively 
studied in connection with software testing [10]. Communicating X-machine sys- 
tems were also considered [2] as a model of parallel and communicating processes. 

In this paper it is investigated the power of EOP and EOPP systems in 
connection with three parameters: number of membranes, states and set of dis- 
tributed rules. It is proven that the family of Parikh sets of vectors of numbers 
generated by EOP systems with one membrane, one state and one single set 
of rules coincides with the family of Parikh sets of context-free languages. The 
hierarchy collapses when at least one membrane, two states and four sets of 
rules are used and in this case a characterization of the family of Parikh sets of 
vectors associated with ETOL languages is obtained. It is also shown that every 
EOP system may be simulated by an EOPP system and EOPP systems may 
be used for solving NP-complete problems. In particular linear time solutions 
are provided for the SAT problem. The last result relies heavily on similarities 
between EOPP systems and P systems with replicated rewriting [11] and EPP 
systems [1]. 
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2 Definitions 

Definition 1. A stream Eilenberg machine is a tuple 

X ={E,r,Q,M,^,F,I,T,mo), 



where: 

— E and F are finite sets called the input and the output alphabets, respectively; 

— Q is the finite set of states; 

— M is a (possibly infinite) set o/ memory symbols; 

— (h is a set of basic partial relations on E x M x M x F*; 

— F is the next state function F : Q x ^ 2^; 

— I and T are the sets of initial and final states; 

— mo is the initial memory value. 



Definition 2. An EOP system is a construct iT = (p.,X), where n is a mem- 
brane structure consisting ofm membranes, with the membranes and the regions 
labelled in a one to one manner with the elements and an Eilenberg 

machine whose memory is defined by the regions of p,. The Eilenberg 

machine is a system 



having the following properties 

— V is the alphabet of the system; 

— Q,F are as in Definition 1; 

— Ml, . . . ,Mm are finite multisets over V and represent the initial values oc- 
curring in the regions 1, ... ,m of the system; 

— <P = {(j>i,. . . ,4)p}, 4>i = {Ri^i, . . . Ri^rn), ^ < i < P and R^j is a set of 
evolution rules (possibly empty) associated with region j, of the form X 
(ui,tari) . . . (uhjtarh), with X a multiset over V, Ui € V, tavi G {here, out, 
in}, 1 < i < h; the indication here will be omitted. 

— I = {<?o}) qo & Q is the initial state; all the states are final states (equivalent 
toQ = T). 

It may be observed that the set E and mo from Definition 1 are no longer used 
in the context of EOP systems. In fact, these concepts have been replaced by V 
and Ml, . . . , Mm, respectively (all the symbols are output symbols, F = V). 

A P system has m sets of evolution rules, each one associated with a region. 
An EOP system has the evolution rules distributed among p components fii, 
1 < i < p, each one containing m sets of evolution rules. 

A computation in II is defined as follows: it starts from the initial state go 
and an initial configuration of the memory defined by Mi , . . . Mm and proceeds 
iteratively by applying in parallel rules in all regions, processing in each one 
all symbol objects that can be processed; in a given state q, each multiset that 
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coincides with the left hand side of a rule is processed and the results are then 
distributed to various regions according to the target indications of that rule 
(for instance, when rewriting X hy a rule X (ui,tari) . . . {uh,tarh), the com- 
ponent of the multiset u\ . . .Uh obtained will be send to the regions indicated 
by tavi, 1 < i < h with the usual meaning in P systems (see [3], [13], [7])); the 
rules are from a component <j)i which is associated with one of the transitions 
emerging from the current state q and the resulting symbols constitute the new 
configuration of the membrane structure with the associated regions; the next 
state, belonging to F{q,<pi), will be the target state of the selected transition. 
The result represent the number of symbols that are collected outside of the 
system at the end of a halting computation. 

EOPP systems have the same underlying construct (/r, X), with the only 
difference that instead of one single membrane structure, it deals with a set of 
instances having the same organization (/x), but being distributed across the 
system. More precisely, these instances are associated with states called active 
states; these instances can divide up giving birth to more instances or collide 
into single elements depending on the current configuration of the active states 
and the general topology of the underlying machine. Initially only q^ is an active 
state and the membrane configuration associated with q^ is Mi, . . . , Mm- All 
active states are processed in parallel in one step: all emerging transitions from 
these states are processed in parallel (and every single transition processes in 
parallel each string object in each region, if evolution rules match them). 

Cell division: if qj is one of the active states, Mj^i, . . . , Mj^m is its asso- 
ciated membrane configuration instance, and (j>j^i, ■ ■ ■ , are <P's components 
associated with the emerging transitions from qj, then the rules occurring in 
4 >j,i 7 1 < * < are applied to the symbol objects from Mj^i, . . . , Mj^rm the 
control passes onto qj^i, . . . , qj^t, which are the target states of the transitions 
earlier nominated, with Mj^i^i , . . . , Mj ra,i, ■ • • , Mj i t, . . . , Mj „i,t, their associ- 
ated membrane configuration instances, obtained from Mj i , . . . , Mj^rn, by ap- 
plying rules of (j)j^i, . . . ,<j)j^t, the target states become active states, q is de- 
sactivated and Mj^i, . . . , vanish. Only <f)jj components that have rules 

matching the symbol objects of Mj^i , . . . , Mj^rn, are triggered and consequently 
only their target states become active and associated with memory instances 
Mj^ij, . . . , Mj^rn,i- If none of (j>jj is triggered, then in the next step q is desacti- 
vated and AT,- 1 , . . . , Mj^m vanish too. If some of (j>jj are indicating the same com- 
ponent of <P, then the corresponding memory configurations Mj^ij,...,Mj^rn,i 
are the same as well; this means that always identical transitions emerging from 
a state yield the same result. 

Cell collision: if 4>\, . . . ,<j)t enter the same state r and some or all of them 
emerge from active states, then the result associated with r is the union of 
membrane instances produced by those (j)[s emerging from active states and 
matching string objects from their membrane instances. 

A computation of an EOP (EOPP) system halts when none of the rules 
associated with the transitions emerging from the current states (active states) 
may be applied. 
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If n is an EOF system, then Ps{U) will denote the set of Parikh vectors 
of natural numbers computed as result of halting computations. The result of 
a halting computation in II is the Parikh vector I'v{w) = (Itt'Ui , • ■ • , |tc|a;,), 
where w is the multiset formed by all the objects sent out of the system during 
the computation with |w|ai denoting the number of ads occurring in w and 
V — {ai, . . . , Cln\ • 

The family of Parikh sets of vectors generated by EOF (EOPP) systems 
with at most m membranes, at most s states and using at most p sets of rules 
is denoted by PsEOPm,s,p{PsEOPPm,s,p)- If one of these parameters is not 
bounded, then the corresponding subscript is replaced by *. The family of sets 
of vectors computed by P systems with non-cooperative rules in the basic model 
is denoted by PsOP{ncoo) [13]. 

In what follows, we need the notion of an ETOL system, which is a construct 
G = {V,T,w, Pi, . . . , Pm), fTi> 1, where V is an alphabet, T CV, w G V*, and 
Pi, 1 < i < m, are finite sets of rules (tables) of context-free rules over V of the 
form a X. In a, derivation step, all the symbols present in the current sentential 
form are rewritten using one table. The language generated by G, denoted by 
L{G), consists consists of all strings over T which can be generated in this way, 
starting from w. An ETOL system with only one table is called an EOL system. 
We denote by EOL and ETOL the families of languages generated by EOL systems 
and ETOL systems, respectively. Furthermore, we denote by by PsEOL and 
PsETOL the families of Parikh sets of vectors associated with languages in ETOL 
and EOL, respectively. It is know that PsGE C PsEOL C PsETOL C PsGS. 
Details can be found in [14]. 

3 Computational Power of EP Systems and EPP Systems 

We start by presenting some preliminary results concerning the hierarchy on the 
number of membranes and on the number of states. 

Lemma 1. (i) PsEOPi_\,i = PsOPi{ncoo) = PsGE, 

(ii) PsEOP»^^,^^. = 

(iii) = PsEOP\^2,*- 

Proof, (i) EP systems with one membrane, one state and one set of rules are 
equivalent to P systems with non-cooperative rules in the basic model. 

(ii) The hierarchy on the number of membranes collapses at level 1. The in- 
clusion C PsEOP*,*,* is obvious. For the opposite inclusion, 

the construction is nearly the same as those provided in [13] for the basic 
model of P systems. We associate to each symbol an index that represent the 
membrane where this object is placed and, when we move an object from a 
membrane to another one, we just change the corresponding index. 

(iii) The hierarchy on the number of states collapses at level 2. The inclusion 
PsEOPi^ 2 ,* G PsEOP^^t^f, is obvious. On the other hand, consider an EP 
systems LI, with Ps{II) G PsEOP\^s,p for some s > 3, p > 1 (yet again, the 
cases s = 1 or s = 2 are not interesting at all), such that: 
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where X = (V, Q, F, qg), with , cj)p}. We construct an EP 

systems 77', with Ps{II') G PsEOP\^ 2 ,p +2 such that: 



where 

X' = ((y U Q U {#, e}),{q, g'}, 7>', F', q), 

with q,q',^,e ^ {V LI Q) (i.e., are new symbols that do not appear 

neither in V nor in Q), 

M[ = M[ U {qo},^!^' = , (/'p, <l>p+i,(l>p+2}- 

For each 1 < j < p, we have: 

<t>j = {(j)j L{p^ q\ = g} U {p ^ # Ip G Q} U {# ^ #}), 

and F'{<j)p,q) = q. Moreover, we have: 

(/)p+i = ({p^ e|pG Q}), 

(j)p+2 = {{a^#\a^v G(l)j,l<j <p}L{#^ #}), 
and F' = (<^p+i,g) = q', F' = (</>p+ 2 ,g') = q' ■ 

We have placed inside the skin membrane the initial state of the system 77. 
In general, we may suppose to have inside membrane 1 an object p that 
represent the current state of the state machine associated with the system 
77. Thus, we apply the rules as in the system 77, by using some and we 
change the state by using rule p ^ < 7 , if F((j)j,p) = q. At any moment, if we 
choose the wrong set of rules with respect to the current state (i.e., there 
does not exists any state q such that F{(j>j,p) = q), then we are forced to 
apply a rule p ^ and, due to the rule # ^ we generate an infinite 
computation. In order to finish a computation, we have to trigger on 4>p+ii 
which replaces the current state with e and lead the system to the state q' . 
Here, if there are rules that can be still applied to the objects present inside, 
then an infinite computation is generated, as we can continue to use the rules 
inside membrane 1 (j)p +2 forever. 

It follows that Ps{F[') = Ps{n). □ 

Now, we are able to show the main result concerning the power of EP sys- 
tems, which provides a characterization of the family of Parikh sets of vectors 
associated with ETOL languages. 

Theorem 1. PsEOP\^ 2 ,* = PsEOP\_ 2 ,a = PsETOL. 

Proof, (i) PsETOL C PsEOP\^ 2 ,a- According to Theorem 1.3 in [14], for each 
language L G ETOL there is an ETOL system that generates L and contains 
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only two tables, that is, G = {V,T,w, Pi, P2). Therefore, we can construct 
an EP system 

n = {[i]i,x), 

where 

X'={{VUTU {#, /, /'}),{<?, q'}, Mi,<I>, F, q), 

with q,q',#,f ^ (PUT) (i.e., q,q' f, f are new symbols that do not 
appear in P or in T), 



Ml = WU {/},^= 



We have: 

= {Pi), 

<(>2 = ( T2 ), 

</>4 = ({a ^ # I a G (P - T) } U {a ^ (a, out) | a G T }{# ^ #}), 

and F = {(j)i,q) = q, F = {(p2, q) = q, F = q) = q' , F = ((^4, q') = q' . 
The EP system FI works as follows. We start in the state g; here, we can 
use either (j>i or <j)2 as many times as we want in order to simulate either the 
application of the rules in Pi or the application of the rules in P2. At any 
moment, we can decide to move from state q to state q' by triggering (^3. In g', 
we use (j)4 in order to replace each non-terminal symbol with ^ and send out 
of the system each terminal symbol. In this way, if membrane I contains only 
terminal symbols, the computation halts successfully, otherwise we generate 
an infinite computation that yields no result. Thus, it is easy to see that 
Ps{n) = Ps{L{G)). Furthermore, we have Ps{Pi) G PsEOP\^ 2 ,i- 
(ii) Consider an EP systems 77, with Ps{F[) G PsEOPi^ 2 ,p for some s > 3, 
P > 1 (yet again, the cases s = 1 or s = 2 are not interesting at all), such 
that 

where X = {V,Q,Mi,<P,E,qo), with <P = {(pi , . . . , <j)p) . Thus, we construct 
an ETOL systems 



G= ((PuguU{a|aP}U{#}),P,Migo,Pi,...,Pp,Pp+i), 

where Mi go denotes a string containing symbols a for every a G Mi and for 
each 1 < j < p, Pj is a table that contains: 

— a rule a —> v', for each rule a —( v G (pp, with v' a string obtained from v 
as follows: it contains a symbol b if (6, out) C v, a symbol b, if b C v and 
there exists a rule b ^ u € (pi, I < i < p, and A (i.e., no symbol), for 
each b C V such that there does not exist any rule b ^ u G (pi, 1 < i < p; 

— a rule p ^ q, for each p,q G Q such that F{(pj,p) = q; 

— a rule p ^ for each p G Q. 
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Moreover, we have: 

Pp+i ={p^A|pG(5}U{d^#,a^a|oGy}. 

Now, it is easy to see that the ETOL system simulates the EP P system 
n correctly. We start with a string M\qo', we apply the tables Pi,...,Pp 
in order to simulate the application of the rules of 77 an the corresponding 
transitions in the underlying state machine. At any moment, if we choose 
the wrong tables with respect to the current states, then we are forced 
to introduce in the string the non-terminal which cannot be replaced 
by any rule anymore. Finally, we have the Pp+i that is used to simulate 
the end of a successful computation in 77: if we use this table when there 
still exist some rules that can be applied to the symbols present in the 
current configuration, then we introduce the non-terminal ^ which can- 
not be removed from the string anymore; otherwise, we get a terminal 
string. □ 

As an EOL system is a an ETOL system with only one table, we get immedi- 
ately the following result. 

Corollary 1. PsEOL C PsEOPi^ 2 , 3 - 

EPP systems exhibit a parallel behaviour not only inside of the membrane 
structure but also at the underlying machine level. Potentially, all transitions 
emerging from active states may be triggered in one step giving birth to new 
cells or colliding others. One problem addressed in this case is also related to the 
power of these mechanisms. In the sequel we will show that EPP systems are at 
least as powerful as EP systems. 

Theorem 2. If 77 is an EP system with m membranes, s states and p sets of 
rules then there exists 77 an EPP system with m' > m membranes, s' > s states 
and p' > p rule transitions such that Ps{II) = Ps{II). 

Proof. Let 77 = {p,X), be an EP system where /i is a membrane structure 
consisting of m membranes, and X an Eilenberg machine 

X={V,P,Q,Xh,...M^,^,P,I), 

where Q has s states and <P contains p components. The following EPP system 
is built 77' = {fj,,X'), where 



with 

- V = yu{a;}U{fc I 1 < fc < t}, where t is the maximum number of transitions 
going out from every state of A; 

- g' = Q U {qjfi I qj G Q} U {qj^k | gj G Q, 1 < fc < 7, }; 

- A7( = Ml U {x}; Mj = M,-, 2 < j < m; 
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- <p' = <pij {(pa;,(t>ix, ■ ■■ , 0i£c}, where 

• (j)x = {{x ^ k \ 1 < k 

• 4>kx = {{k ^ . . ,0),1 < k <t, 

— for any qj G Q if there are 1 < u < t, transitions emerging from qj and 

F{qj, 4>j^k) = 1 < fc < t6 (not all /pj^k are supposed to be distinct) then 

the following transitions are built in 7T7T : 

F F {qjfijpkx^ 1 ^ ^ w, 

A computation in FI' proceeds as follows: at the beginning only the initial 
state is active and the memory configuration in this state is , . . . , . If the 

EP system 77 is in a state qj and the memory configuration is Mj^i, . . . 
then 77' must be in qj as well. We will show that 77' has always only one active 
state. Indeed, if qj is an active state in 77' and Mj^i , . . . , Mj^m are its associated 
membrane configuration, then in one step x from Mj^i is changed by px into k, 
a value between 1 and t; if u is the number of emerging transitions from qj in 
77, then k > u implies that in the next step the current membrane configuration 
will vanish as no more continuation is then allowed from qj^; otherwise, when 
1 < fc < M, only one transition may be triggered from gyo and this is associated 
with (pkx which restores x back into Mj^i (the other transitions emerging from qj^Q 
cannot be triggered), pkx leads the EPP system into qj,k,i - From this state there 
are two transitions both associated with <Pj^k that are triggered and consequently 
Mj_i, . . . , Mj^m are processed yielding Mj -^, . . . , M'j^^ and some symbol objects 
may be sent out of the system. In every step only one state is active. In this way 
77 and 77' compute the same objects, thus Ps(77) = Ps{U'). □ 

Note 1. From Theorem 2 it follows that if the EP system 77 has m membranes, 
s states, p components of and the maximum number of transitions emerging 
from every state is t then the equivalent EPP system has m' = m membranes, 
at most s' = (1 -I- t)s states, and at most p' = p + t+ 1 sets of rules. 

4 Linear Solution to SAT Problem 

SAT (satisfiability of propositional formulae in conjunctive normal form) is a well 
known NP-complete problem. This problem asks whether or not, for a given for- 
mula in the conjunctive normal form, there is a truth-assignment of the variables 
such that it becomes true. So far some methods to solve in polynomial or just 
linear time this problem have been indicated, but at the expense of using an 
exponential space of values. In [I] the problem has been solved in time linear in 
the number of variables and the number of clauses by using EPP systems. It is 
worth mentioning that in [I] the method used relied essentially on string objects 
dealt with by EPP systems. In the sequel we will show that SAT problem may 
be solved in linear Time by using EOPP systems. 

Theorem 3. The SAT problem can he solved in a time linear in the number of 
variables and the number of clauses by using an EOPP system. 
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Proof. Let 7 be a formula in conjunctive normal form with m clauses, Ci, . . . , Cm, 
each one being a disjunction, and the variables used are xi, . ■ ■ ,Xn- The following 
EOPP system, II = may be then constructed: 

M = [lb- ■ • [m+l]m+l ■ • -]2]l, 

X = {V,Q,M^,...Mm+i,^,F,I), 

where: 

- V = {{ii, \ij & {0, 1}, 1 < J < A:, 1 < A: < n}; 

- Q = { 9 }; 

- Ml = . . . = = 0, Mm+i = {(0), (1)}; 

- <P = 

• 00 = (0,---, 0, {(*!,• ^ I 

ij G {0, 1}, 1 < j < A:, 1 < A; < n — 1}), 

• 01 = (0,...,0,{(ii,...,ifc) ^ (Ai,...,Zfe,l) I 



ij G 


{0,1}, 1 


< 


j<k,l< 


k < n - 


-1}), 








02 = 


= ({(*!,. . 


• 1 


ij = 1, . . . 


, in) 




. . ,ij - 


= 1,- 


• ■ , in) 


Xj is 


i present 


in 


Ci,l<j 


< zzjU 










{(*1: 


, ... ,ij = 


^ 0, 


1 ■ ■ ■ ,in) 


((*!,.. 


. ,ij = 


= 0,.., 


■ 5 ^ 77)5 


out) 1 


^Xj 


is present in Ci , 1 < . 


; < n}, 












, ... ,ij — 


: 1, 


1 • • ■ , in) 


((*!,.. 


. ,ij = 


= 1,.., 


■ 5 ^ 77)5 


out) 1 


Xj is 


i present 


in 


Cm, 1 < j 


< rzjU 












, ... ,ij = 


0, 


1 ■ • ■ , in) 




. ,ij = 


= 0,.., 


• 5 ^ 77)5 


out) 1 



^Xj is present in Cm, ^ ^ j n}, 

0 ); 

where (ii, . . . ,ij = 1 , . . . , z„) and (zi, . . . ,ij = 0 , . . . , z„) denote elements 
of V having the 7— th component equal to 1 and 0, respectively; 

— F{q,<Pk) = {q},0<k<2; 

— I = U}- 

The system 77 starts from state q with 0, . . . , 0, {(0), (1)}. By applying n 
times 00 and 0i in parallel one generates all truth values for the n variables in 
the form of 2" symbols (zi, . . . , z„) with ij = 1 or z^ = 0 indicating that variable 
Xj is either true or false. All these combinations are obtained in n steps in state 
q. In the next m steps 02 checks whether or not at least one truth-assignment 
satisfies all clause; this, if exists, will exit the system. The SAT problem is solved 
in this way in rz -I- to steps. □ 

There are some important similarities between the above theorem and The- 
orem 5 in [1]: 

— the same membrane structure; 

— the first to initial regions empty; 

— the truth and false values introduced in parallel; 

but also relevant distinct features: 

— less states and a simpler definition of F in the above theorem; 

— linear (in rz) number of symbols in V and rules, in the case of Theorem 5 [1], 
but exponential number of corresponding components, in the above theorem. 
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5 Conclusions 

In this paper two types of Eilenberg P systems, namely EOP systems and EOPP 
systems, have been introduced. They combine the control structure of an Eilen- 
berg machine as a driven mechanism of the computation with a cell-like structure 
having a hierarchical organisation of the objects involved in the computational 
process. The computational power of EOP systems is investigated in respect of 
three parameters: number of membranes, number of states, and number of sets 
of rules. 

It is proven that the family of Parikh sets of vectors of numbers generated by 
EOP systems with one membrane, one state and one single set of rules coincides 
with the family of Parikh sets of context-free languages. The hierarchy collapses 
when at least one membrane, two states and four sets of rules are used and in 
this case a characterization of the family of Parikh sets of vectors associated 
with ETOL languages is obtained. It is also shown that every EOP system may 
be simulated by an EOPP system. 

EOPP systems represent the parallel counterpart of EOP systems, allowing 
not only the rules inside of the cell-like structure to develop in parallel, but also 
the transitions emerging from the same state. More than this, all states that are 
reached during the computation process as target states, may trigger in the next 
step all transitions emerging from them. It is shown that a general method to 
simulate an EOP system as an EOPP system may be stated. Apart from the 
fact that EOPP systems might describe interesting biological phenomena like cell 
division and collision, it is also an effective mechanism for solving NP-complete 
problems, like SAT, in linear time. 
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Abstract. We examine hypotheses coming from the physical world and 
address new mathematical issues on tiling. We hope to bring to the at- 
tention of mathematicians the way that chemists use tiling in nanotech- 
nology, where the aim is to propose building blocks and experimental 
protocols suitable for the construction of ID, 2D and 3D macromolecu- 
lar assembly. We shall especially concentrate on DNA nanotechnology, 
which has been demonstrated in recent years to be the most effective 
programmable self-assembly system. Here, the controlled construction 
of snpramolecular assemblies containing components of fixed sizes and 
shapes is the principal objective. We shall spell ont the algorithmic prop- 
erties and combinatorial constraints of “physical protocols” , to bring the 
working hypotheses of chemists closer to a mathematical formulation. 



1 Introduction to Molecular Self-assembly 

Molecular self-assembly is the spontaneous organisation of molecules under ther- 
modynamic equilibrium conditions into a structurally well-defined and rather 
stable arrangement through a number of non-covalent interactions [5,26,52]. It 
should not be forgotten that periodic self-assemblies of molecules lead to crystals 
in one, two or three dimensions; we often do not understand the interactions 
between the constituents of a crystal, but their presence in our world was an 
existence-proof for 3D self-assembly long before the notion was voiced. By a 
non-covalent interaction, we mean the formation of several non-covalent weak 
chemical bonds between molecules, including hydrogen bonds, ionic bonds and 
van der Waals interactions. These interactions (of the order of 1-5 kcal/mol) 
can be considered reversible at normal temperatures, while covalent interactions 
(typically > 50 kcal/mol) are regarded as irreversible. 

The self-association process leads the molecules to form stable hierarchical 
macroscopic structures. Even if the bonds themselves are rather weak, their 
collective interaction often results in very stable assemblies; think, for example, 
of an ice cube, held together by hydrogen bonds. Two important elements of 
molecular self-assembly are complementarity and self- stability, where both the 
size and the correct orientation of the molecules are crucial in order to have a 
complementary and compatible fitting. 



N. Jonoska et al. (Eds.): Molecular Computing (Head Festschrift), LNCS 2950, pp. 61—83, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




62 



Alessandra Carbone and Nadrian C. Seeman 



The key engineering principle for molecular self-assembly is to design molec- 
ular building blocks that are able to undergo spontaneous stepwise interactions 
so that they self-assemble via weak bonding. This design is a type of “chemical 
programming” , where the instructions are incorporated into the covalent struc- 
tural framework of each molecular component, and where the running of the 
algorithm is based on the specific interaction patterns taking place among the 
molecules, their environment, and the intermediate stages of the assembly. The 
aim of the game is to induce and direct a controlled process. 

Molecular self-assembly design is an art and to select from the vast virtual 
combinatorial library of alternatives is far from being an automatic task [19]. 
There are principles though, that could be mathematically analyzed and one 
of the purposes of this paper is to lead the reader towards such possibilities. 
We shall talk mainly about self-assembly from branched DNA-molecules, which 
in the last few years have produced considerable advances in the suggestion 
of potential biological materials for a wide range of applications [39]. Other 
directions using peptides and phospholipids have been also pursued successfully 
[57,35,4]. 

We shall start with an abstract overview of some of the principles governing 
self-assembly which have been investigated by chemists (for an introduction see 
also [27]), with a special emphasis on DNA self-assembly. With the desire to for- 
malise in an appropriate mathematical language such principles and develop a 
combinatorial theory of self-assembly, we try to suggest mathematical structures 
that arise naturally from physical examples. All through the paper, we support 
our formalistic choices with experimental observations. A number of combinato- 
rial and algorithmic problems are proposed. The word “tile” is used throughout 
the paper in a broad sense, as a synonym of “molecule” or of “combinatorial 
building block” leading to some assembly. 

2 Examples of Molecular Self-assembly and Scales 

Self-assembled entities may be either discrete constructions, or extended assem- 
blies, potentially infinite, and in practice may reach very large sizes. These assem- 
blies include such species as 1-dimensional polymolecular chains and fibers, or 
2-dimensional layers and membranes, or 3-dimensional solids. Due to the excep- 
tionally complicated cellular environment, the interplay of the different ligand 
affinities and the inherent complexity of the building blocks, it is not easy to 
predict, control and re-program cellular components. Proteins can in principle 
be engineered but to predict protein conformation is far from our grasp nowa- 
days. At the other extreme lie chemical assemblies, such as organic or inorganic 
crystals, which are constituted by much simpler structural components that are 
not easily programmed. Within this spectrum of assembly possibilities, DNA 
self-assembly has revealed itself as the most tractable example of programmable 
molecular assembly, due to the high specificity of intermolecular Watson-Crick 
base-pairing, combined with the known structure formed by the components 
when they associate [31]. This has been demonstrated in recent years both the- 
oretically and experimentally as we shall discuss later. 
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3 Molecular Self-assembly Processes 

There are three basic steps that define a process of molecular self-assembly: 

1. molecular recognition: elementary molecules selectively bind to others; 

2. growth: elementary molecules or intermediate assemblies are the building 
blocks that bind to each other following a sequential or hierarchical assembly; 
cooperativity and non-linear behavior often characterize this process; 

3. termination: a built-in halting feature is required to specify the completion 
of the assembly. Without it, assemblies can potentially grow infinitely; in 
practice, their growth is interrupted by physical and/or environmental con- 
straints. 

Molecular self-assembly is a time- dependent process and because of this, tem- 
poral information and kinetic control may play a role in the process, before 
thermodynamic stability is reached. For example, in a recent algorithmic self- 
assembly simulating a circuit constituted by a sequence of XOR gates [30], a 
template describing the input for the circuit, assembled first from DNA tiles 
as the temperature was lowered, because these tiles were programmed to have 
stronger interactions; the individual tiles that performed the gating functions, 
i.e. the actual computation of each XOR gate, assembled on the template later 
(at a lower temperature), because they interacted more weakly. If, as in this 
example, the kinetic product is an intermediate located on the pathway towards 
the final product, such a process is sequential. If not, then the process is said to 
bifurcate. 

Molecular self-assembly is also a highly parallel process, where many copies 
of different molecules bind simultaneously to form intermediate complexes. One 
might be seeking to construct many copies of the same complex at the same time, 
as in the assembly of periodic ID or 2D arrays; alternatively, one might wish to 
assemble in parallel different molecules, as in DNA-based computation, where 
different assemblies are sought to test out the combinatorics of the problem 
[1,22]. A sequential (or deterministic) process is defined as a sequence of highly 
parallel instruction steps. 

Programming a system that induces strictly sequential assembly might be 
achieved, depending on the sensitivity of the program to perturbations. In a ro- 
bust system, the instructions (that is the coding of the molecular interactions) 
are strong enough to ensure the stability of the process against interfering inter- 
actions or against the modification of parameters. Sensitivity to perturbations 
limits the operational range, but on the other hand, it ensures control on the 
assembly. 

An example of strong instructions is the “perfect” pairing of strands of dif- 
ferent length in the assembly of DNA-tiles due to Watson-Crick interacting se- 
quences. The drawback in sequential assembly of DNA-tiles is due to the complex 
combinatorics of sequences which are needed to construct objects with discrete 
asymmetric shapes, or aperiodic assemblies. The search for multiple sequences 
which pair in a controlled way and avoid unwanted interactions is far from being 
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an obvious task. Alternative approaches concern besides tile design, self-assembly 
algorithms and protocols (Section 7). 

A sequential process might either be commutative, if the order of the assem- 
bly steps can be interchanged along the pathway leading to the final assembly, 
or it might be non-commutative, if the intermediates need to interact in a fixed 
consecutive manner. DNA-based computations, such as the assembly of graphs 
[34] are commutative: a series of branched junctions can come together in any 
order to yield the final product (as discussed in Section 6 for 3-color graphs) . An 
example of a non-commutative process is the construction of DNA tiles along the 
assembly of a periodic 2D array: single stranded DNA sequences are put in a pot 
at once, and since the tiles melt at a temperature higher than the intermolecu- 
lar interactions, tiles are “prepared” first, before the 2D assembly takes place. 
Even if indirectly, these physical conditions imply non-commutativity. Later on, 
the 2D lattice can assemble with gaps that can later be filled in from the 3rd 
direction. Commutativity, in this latter step, may create irregularities when 3D 
arrays are considered instead, since gaps might get sealed in as a defect. Any hi- 
erarchical construction, such as solid-support-based DNA object synthesis [58] is 
non-commutative. Another example of a non-commutative assembly is a frame- 
based construction [32], wherein an assembly is templated by a “frame” that 
surrounds it: tiles assemble within the boundaries of the frame and they are 
guided by the code of the tiles forming the frame. It is non-commutative, in that 
the frame has to be available first. 




Fig. 1. Protocol for the Synthesis of a Quadrilateral. The intermolecular addi- 
tions of corners is repetitive, but a different route leads to intramolecular closure. 



Another characteristic of a molecular self-assembly is that the hierarchical 
build-up of complex assemblies, allows one to intervene at each step, either to 
suppress the following one, or to orient the system towards a different pathway. 
For example, the construction of a square from identical units using the solid- 
support method entailed the same procedures to produce an object with 2, 3, or 
4 corners. Once the fourth corner was in place, a different pathway was taken 
to close the square [58], as shown in Figure 1. A pentagon, hexagon or higher 
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Fig. 2. Triplet junctions GPV, JRV, JGS and PSR can combine in different 
configurations. The two smallest ones are a tetrahedron and a cube. 



polygon could have been made by the same procedure, just by choosing to close 
it after adding more units. 

Instructions might be strong but still allow for different objects to appear. 
The same set of tiles might assemble into objects with different geometrical 
shapes and different sizes, that satisfy the same designed combinatorial coding. 
For instance, consider chemical “three-arm junction nodes” (name them GPV, 
JRV, JGS, PSR) accepting 6 kinds of “rods”, called G, P, V, J, S and R. Sev- 
eral geometrical shapes can be generated from these junctions and rods, in such 
a way that all junctions in a node are occupied by some rod. Two such shapes 
are illustrated in Figure 2. In general, there is no way to prevent a given set of 
strands from forming dimers, trimers, etc. Dimers are bigger than monomers, 
trimers bigger than dimers, and so on, and size is an easy property for which to 
screen. However, as a practical matter, entropy will favor the species with the 
smallest number of components, such as the tetrahedral graph in Figure 2; it can 
be selected by decreasing the concentration of the solution. If, under convenient 
conditions, a variety of products results from a non-covalent self-assembly, it is 
possible to obtain only one of them by converting the non-covalent self-assembly 
to a covalent process (e.g., [16]). Selecting for specific shapes having the same 
number of monomers though, might be difficult. It is a combinatorial question 
to design a coding for a set of tiles of fixed shape, that gives rise to an easily 
screenable solution set. 

4 Molecular Tiling: A Mathematical Formulation 

Attempts to describe molecular assembly, and in particular DNA self-assembly, 
in mathematical terms have been made in [2,6]. Here, we discuss some algo- 
rithmic and combinatorial aspects of self-assembly keeping in mind the physics 
behind the process. 

Tiles and self-assembly. Consider a connected subset T (tile) in K^, for example 
a convex polyhedron, with a distinguishable subset of mutually complementary 
(possibly overlapping) non-empty domains on the boundary, denoted Db, C 
dT, where b runs over a (possibly infinite) set B. We are interested in assemblies 
generated by T, that are subsets A in the Euclidean space, decomposed into a 
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Fig. 3. A Variety of Complements to a Single Strand. Panel (a) illustrates a 
conventional Watson-Crick duplex, where strand 2 complements strand 1 . Panels 
(b-e) illustrates a variety of complements to strand 1 . 



union of congruent copies of T, where two copies may intersect only at their 
boundaries and have a “tendency” to meet across complementary domains on 
the boundary. It is important to recognize that in the case of DNA, there are 
many forms of complementarity, as a function of motif structure [41]. Figure 3 
illustrates a DNA strand (named 1) complementary to a variety of other DNA 
strands; more complex types of complementarity exist, such as complementarity 
in the PX sense [60,46] or in the anti-junction sense [12,60]. 

We want to consider a biological macromolecule T (e.g., a protein or a nucleic 
acid motif), with complementary binding sites D'^ such that different copies 
of T bind along complementary domains and self-assemble into complexes. In 
the geometric context we specify the binding properties by introducing (binding) 
isometries 6 : ^ to each b G B such that T and h{T) intersect only at 

the boundary, and b{Dif) = £)[,. From now on B is understood as a subset in the 
Euclidean isometry group /so(M^). 

Accordingly, we define an assembly A associated with (T, B) by the following 
data: 

1. a connected graph G = Ga with the vertex set 1 ... TV, 

2. subsets Ti in K^, where i = 1 . . . which may mutually intersect only at 
their boundaries, 

3. an isometry bk^i : ^ moving Tk onto Ti, for each edge {k, 1) in G, such 

that there exists an isometry Oky which moves Tk to T and conjugates bk^i 
to some binding isometry b in B. Notice that this b is uniquely determined 
by bkj up to conjugation. 

Given a graph Ga and a tile T, the assemblies described by Ga and T 
might not be unique. The assembly is unambiguously described by the isometries 
associated to the edges of Ga (i-e. condition (3) above). See Figure 4 for an 
example. 
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Fig. 4. Four copies of the same tile are arranged in two different assemblies that 
correspond to the same graph Ga- The labels a,a,b,b correspond to codes for 
edges. 



Several tiles. If we start with several different tiles T^, , T" rather than with 
a single T, we consider the sets of pairs of binding isometries C Iso(]R^) x 
/so(M^) such that and b'’^^ (T^) intersect only at their boundaries and 

their intersection is non-empty. The definition of an assembly associated to 
({T*}, goes as above with the following modifications: the graph G has 

vertices colored by the index set 1 . . . n, the corresponding subsets in are de- 
noted where i = 1 .. .n and k = 1 ... Ni., and finally, we forfeit the isometries 
bkj and for each edge (k'^,1^) we emphasize an isometry of which moves 
to and T/ to b^^{T^). 

In what follows, we refer to the union of tiles defined above, as an assembly. 
Qualities of an assembly. The tightness of the tiling is one quality that chemists 
appreciate. This can be measured by the number of cycles in the graph G, or 
equivalently by the negative Euler characteristic of the graph. 

The imperfection of a tiling is measured by the “unused” areas of the bound- 
aries of the tiles. First define the active domain dact(T) C dT as the union of 
the intersections of dT with b{T) for all b G B. Then define the “unused bound- 
ary” dun{A = UTi) as the union U^idact{Ti) minus the union of the pairwise 
intersections Gi(^k,i)GcTk H Ti. An assembly is called perfect if the area of the 
imperfection equals zero. We say that an assembly contained in a given subset 
X C is perfect with respect to 9X, if C dX. 

The uniqueness refers to the uniqueness of an assembly subject to some 
additional constraints. For example, given an X C one asks first if X can 
be tiled by (T, B) and then asks for the uniqueness of such a tiling. We say that 
(T, B) generates an unconditionally unique assembly if every imperfect assembly 
uniquely extends to a perfect assembly. 

The essential problem of tiling engineering is designing a relatively simple tile 
or a few such tiles which assemble with high quality into large and complicated 
subsets in R^. Here is a specific example for the unit sphere S'^ rather than S^, 
where one uses the obvious extension of the notion of tilings to homogeneous 
spaces. Given e, 5 > 0, consider triangulations of the sphere into triangles A with 
Diam{A) < e and area{A) > 5Diamf{A). It is easy to see that the number of 
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mutually non-congruent triangles in such a triangulation, call it n{e,S), goes to 
oo for e — > 0 and every fixed <5 > 0. The problem is to evaluate the asymptotic 
behavior of n(e, <5) for e — > 0 and for either a fixed <5 or i5 ^ 0. 

Complementarity of the domains. Two tiles Ti and T 2 have complementary sites, 
D^, when they can bind along their boundaries to each other forming a con- 
nected subset of In physical terms, the two overlapping parts can have 

complementary geometrical shape (e.g. think of the concave surface of a protein 
and of the convex surface of a ligand binding to it, much as a classical ’lock 
and key’), but might also correspond to Watson-Crick complementary sequences 
(e.g. 5' — ATTCGA — 3' and 3' — TAAGCT — 5', where A is complementary to 
T and C to G as discussed before; see Figure 3). 




Fig. 5. Left: Rodlike tiles differing in length form an assembly that grows until 
the ends exactly match. Right: polymeric structure growing until the energy 
required to fit new subunits becomes too large. 




Fig. 6. A tile is stable in the assembly only if it binds at two adjacent binding 
sites. The stability of the whole assembly is insured by the enforced stability of 
the template. The formal description of this example is not completely captured 
by our model. 



Real life examples. It remains unclear, in general, how cells control the size of 
(imperfect, with some unused boundary) assemblies, but certain mechanisms 
are understood. For example, out of two rod-like molecules of length three and 
five, one gets a double rod of length 15 as illustrated in Figure 5 (left). Another 
strategy is starting an assembly from a given template (see Figure 6 for a specific 




Molecular Tiling and DNA Self-assembly 



69 



new binding site 




Fig. 7. Tiles which differ in shape and binding sites. Their binding generates a 
new contiguous binding site. 



design). Sometimes, tiling is non-isometric: tiles distort slightly in order to fit, 
and the assembly terminates when the bending energy becomes too costly or 
when the accumulated distortion deforms and deactivates the binding sites (see 
Figure 5 (right)). Also, the binding of a ligand to an active site might change 
the shape of the molecule and thus influence the binding activity of other sites. 
Another possibility is the creation of a new binding site distributed over two or 
more tiles bound together at an earlier stage of the assembly (see Figure 7). These 
mechanisms may produce a non-trivial dynamics in the space of assemblies in 
the presence of free-energy. In particular, one may try to design a system which 
induces a periodic motion of a tile over a template, something in the spirit of 
RNA-polymerase cycling around a circle of DNA [11]. 



HJ DX 



TX 



DNA-P-N DNA-P-B 









Fig. 8. Key Motifs in Structural DNA Nanotechnology. On the left is a Holliday 
junction (HJ), a 4-arm junction that results from a single reciprocal exchange 
between double helices. To its right is a double crossover {DX) molecule, re- 
sulting from a double exchange. To the right of the DX is a triple crossover 
{TX) molecule, that results from two successive double reciprocal exchanges. 
The HJ , the DX and the TX molecules all contain exchanges between strands 
of opposite polarity. To the right of the TX molecule are a pair of DNA paral- 
lelograms, DNA-P-N [29], constructed from normal DNA, and DNA-P-B [43], 
constructed from Bowtie junctions, containing 5’, 5’ and 3’, 3’ linkages in their 
crossover strands. 



DNA tiles and tensegrity. Molecules of some nanometer size, made out of DNA 
strands, have been proposed in a variety of different shapes. See Figure 8 for 
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a representative collection of shapes. Algorithms have been developed success- 
fully to produce that self-assemble into these and other motifs [36]. Branched 
molecules are tiles constituted by several single strands which self-assemble along 
their coding sequences in a “star-like” configuration, where a tip of the star is a 
branch [36,51,38] (Figure 3a,c,e illustrate 2, 3 and 4-arm branched molecules). 
Theoretically, one might think to construct fc-armed branched molecules, for 
any k > 2, where each strand is paired with two other strands to form a pair of 
double-helical arms; in practice, molecules with 6 arms have been reported, but 
larger ones are under construction. The angles between the arms are known to be 
flexible in most cases. If one adds sticky ends to a branched molecule, i.e. single 
stranded extensions of a double helix, a cluster is created that contains specif- 
ically addressable ends [36]. This idea is illustrated in Figure 9, where a 4-arm 
branched junction with two complementary pairs of sticky ends self-assembles 
to produce a quadrilateral. 
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Fig. 9. Formation of a 2-dimensional lattice (right) from a 4-arm branched junc- 
tion (left). X is a sticky end and X' is its complement. The same relationship 
holds for Y and Y'. X and Y are different from each other. 



The search for motifs based on DNA branched junctions that behave as 
though they are “rigid” while in the test tube, led to the design of several DNA- 
molecules, and some are illustrated in Figure 8. Rigid shapes impose strong 
limitations on the design of suitable molecular tiles; roughly speaking, a rigid, 
or tense, object is a 3-dimensional solid that does not undergo deformations: we 
ask that if its 1-dimensional faces do not undergo deformation, then no deforma- 
tion exists. For a tetrahedron or any convex deltahedron, it is easy to see that 
no change of the angles between edges (edges are 1-dimensional faces for the 
tetrahedron) can take place without the edges be deformed. On the other hand, 
a cube is an example of a non-tense object since we can fix the edges (1-faces) of 
the cube not to undergo any deformation and still be able to deform the angles 
between them. 

Geometry of the boundaries: smooth deformations of tiles. It might be appropri- 
ate to consider assemblies which are affected by an e-deformation in the shape 
of the tiles after binding. More precisely, a tile T C is mapped in by some 
e-deformation as follows: there is an embedding e : T C i— > T' C such 

that for all points x G T there is a point y GT' such that the Euclidean distance 
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d{x, y) < e. The definitions of isometry and binding site given at the beginning 
of Section 4 need to be adjusted accordingly into new notions of e-isometry and 
e-binding site, which intuitively correspond to the original notions up to some 
e- variation. One needs to establish whether an e-deformation affects a binding 
site or not, and give thresholds on the amount of deformation which is accepted 
to affect non-empty domains of the boundary. 

The growth of the assembly affected by e-deformation asks for the estimation 
of bounds in the size of the construction. The instability of the system comes from 
a narrow range of conditions on which the assembly takes place. The formation of 
singularities and of bifurcation points between different assemblies, might lead to 
the disruption of the assembly, but might also lead to variety in the complexity. 
Physical considerations on the shape of tiles. In addition to the need to observe 
appropriate solution conditions that encourage self-assembly, it is important to 
realize that there are physical constraints on the assembly of real tiles that do 
not affect virtual tiles. For example, the helicity of random-sequence DNA is 
« 10.5 nucleotide pairs per turn in solution. This value makes it easy to make 
TX molecules (Figure 8) whose three helix axes form an angle of 120°, but 90° is 
much harder, unless one is able (perhaps through sequence variation) to change 
the repeat to 10.4 nucleotide pairs per turn [45]. 

In a similar vein, a likely form of shaped 3D arrays will entail polyhedra whose 
edges contain DX molecules (Figure 8) [40]. It might appear that a tetrahedron 
would be a good polyhedron to use as the basis of such a 3D tile. However, 
although the edges of a tetrahedron obviously span 3-space, there is no group of 
three edges to which one can attach a single extra helix (i.e. to make those edges 
DX molecules instead of single DNA helices, with the extra helices outside the 
helices defining the tetrahedron) to produce the needed vectors: their diameters 
would cause them to clash stereochemically when extended beyond the bound- 
aries of the tetrahedron. Notice that extra-hedral domains on adjacent edges 
inherently clash, and there is no group of three edges in a tetrahedron that does 
not include an adjacent pair. 



5 An Abstract Model to Describe the Dynamics of 
Self-assembly 

A formal description of the dynamics of a self-assembly on a space S, where S can 
be either R, R^, R^ or any discrete approximation of those, can be formulated by 
a simple iterative process as follows. Consider n tiles Ti, . . . , T„, and take a finite 
number of copies of each Ti, for all i = 1 . . .n. At stage 1, randomly assign to 
each physical tile a specific position in S in such a way that no two tiles overlap 
and that only tiles lying side-by-side and having complementary boundary stuck 
together. The set of complexes containing more than one tile with their position 
in S, define a configuration on S'; single tiles are removed from S and used to re- 
iterate the random assignment on the next stage: the configuration of tiles lying 
in S which one reaches at stage i, is filled up further by new non-overlapping 
complementary adjacent tiles at stage i -I- 1. The process is repeated until all 
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tiles are used or when a sufficiently large connected area in S is filled (e.g. area 
> N, for some large N). 

Different outputs might result from this random process: they go from very 
tight assemblies, to assemblies with several unfilled regions, to disconnected sur- 
faces, and so on. The resulting configurations and the time for reaching a con- 
figuration strongly depend on the coding hypothesis, e.g. whether new binding 
sites can appear or not by the combination of several tiles, whether “holes” can 
be filled or not, how many different competing boundary sites are in the sys- 
tem, how many tiles are in the system, whether connected regions can undergo 
translations in S while the process takes place, whether connected tiles might 
become disconnected, etc. 

The process could start from a specific configuration of S instead of using the 
first iteration step to set a random configuration. Such an initial configuration, 
if connected, would play the role of a template for the random process described 
by the iteration steps. Figure 6 illustrates an example of templating, where a 
1-dimensional array of n molecules disposed in a row is expected to play the 
role of a template and interact with single molecules following a schema coded 
by the boundary of the tiles. Another example is the “nano-frame” proposed in 
[32], a border template constraining the region of the tiling assemblies. 

6 Closed Assemblies and Covering Graphs 

Closed tiling systems are assemblies whose binding sites have been all used. More 
formally, this amounts to saying that the graph Ga underlying the assembly A 
is such that all tiles corresponding to its vertices, where i = 1 . . . N, intersect 
on all their boundary sites T^. This means also that the degree of connection 
of each node i of Ga corresponds to the number of available interaction sites 
of Ti, and that each edge (i,l) departing from i corresponds to an isometry 6^,; 
moving Ti onto T; which fixes some binding site. Many graphs might be locally 
embeddable in Ga, and we call them covering graphs of Ga'- a graph G is a 
covering graph of Ga if there is a map p : G ^ Ga such that 

1. p is surjective, i.e. for all nodes y G Ga there is a node x G G such that 

p{x) = y, 

2. if X ^ y in G then p{x) p{y) in Ga, 

3. degree{x) = degree{p{x)) , for all nodes x in G, 

4. for each edge x ^ y € G, 

5. {bx,y ■ X ^ y € G} = {bp(^x),z ■ p{x) ^ z € Ga}, for each node x G G. 

Condition (4), saying that the binding site between and Ty is the same 
as the binding site between Tp(^x) and Tp(^yp and condition (5), saying that the 
binding sites of are the same as the binding sites of Tp(^x), ensure that G 
and Ga under ly tiling systems for the same set of tiles. The graph on the left 
hand side of Figure 10 does not satisfy (5) and provides a counterexample. If Ga 
represents a closed tiling system for a set of tiles S, then each covering graph of 
Ga represents a closed tiling system for S also. 
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Fig. 10. Given a set of six distinct tiles whose binding sites are specific to each 
pair of tile interaction described by edges in the graph Ga (left), notice that G 
(right) is not a covering graph for Ga since it satisfies conditions (1) — (3) but it 
does not satisfy (5) (see text). To see this, consider the mapping p between nodes 
of G and Ga which is suggested by the labels of the nodes. We want to think of 
/ in Ga as representing a tile Tf with two distinct binding sites, one interacting 
with Tc and the other with Td- Node /i is linked to two copies of c and node /2 
is linked to two copies of d; this means that Tf^ having the same binding 

sites asTf, should bind to Tc^,Tc^ (T^^Td^). But this is impossible because the 
binding would require the existence of two identical sites in (Tj^). 

Given a set of tiles one would like to characterize the family of closed assem- 
blies, or equivalently, of covering graphs, if any. An important application is in 
the solution of combinatorial problems. 

Example 1. [22]. A graph G = (V,E) is said to be 3-colorable if there is a sur- 
jective function f '■ V ^ {a, 6, c} such that if v ^ w € E, then f{v) yf f{w)- 
Imagine constructing the graph G with two kinds of molecules, one coding for 
the nodes and one for the edges. Node-molecules are branched molecules, where 
the number of branches is the degree of the node, and edge-molecules are two- 
branched molecules. Each branch of a node-molecule has a sticky end whose 
code contains information on the node of the graph that connects to it and on 
a color for the node. The n branches of a same node-molecule are assumed to 
have the same code. Edge-molecules have two sticky ends and their code con- 
tains information on the origin and target nodes as well as on the colors of such 
nodes. The two colors are supposed to be different. 

To consider three colors in the physical realization of the graph G, one con- 
structs a node-molecule for each one of the three colors, together with all possible 
combinations of pairs of different colors for edge-molecules. 

By combining several identical copies of these molecules and ligating them, 
open and possibly closed assemblies will form. Open assemblies are discharged 
(this can be done with the help of exonuclease enzymes that digest molecules with 
free ends) and closed assemblies, if any, ensure that the graph is 3-colorable. The 
only closed assemblies that can be formed in the test tube are covering graphs. 
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7 A Random Model for Sequential Assembly 

The random model introduced in Section 5 needs to be adjusted slightly to 
simulate sequential assembly. Sequentiality presupposes the formation of specific 
intermediates, i.e. complexes, at specific moments along the assembly process. 
This means that one can start from a random configuration in S, let the input 
tiles form complexes at random, remove from S isolated tiles and use them as 
input tiles to re-iterate the process until a sufficiently large number of specific 
intermediates is formed. This will provide one step of the sequential assembly, 
and in the simplest case, this step will be re-iterated to model the next steps of 
the sequential process, until all steps are realized. Different types of tiles might 
be used as input tiles to perform different steps of the sequential process. 

In more complicated cases, the above model, might need to integrate new 
kinds of steps. It might be that some of the steps of the sequential process 
require the intervention of specific enzymes, cleaving or ligating DNA tiles. Such 
operations are random and their effect on tiles and complexes can be described 
rigorously. Also, one might need to consider that tiles forming a complex at step 
i, disassemble in step i -I- 1 because of the interaction with new molecular tiles. 
This process is also random and can be formally described. 

As mentioned in Section 3, the difficulty in inducing a sequential assembly 
comes from the complex combinatorics needed to realize objects of irregular but 
well-defined shape or aperiodic assemblies. A number of solutions have been 
proposed to overcome these combinatorial difficulties; they concern tile design 
(l)-(2), the algorithm for self-assembly (3)-(4) and the engineering protocol (5): 

1. a variety of different forms of cohesion have been proposed, such as sticky 
ended cohesion, where single-stranded overhangs cohere between molecules 
[10]; PX cohesion, where topologically closed molecules cohere in a double- 
stranded interaction [60]; edge-sharing, where the interactions are charac- 
terized by lateral interactions [46]; tecto-RNA, where RNA domains pair 
laterally through loop osculations [21]; 

2. one can allow different forms of coding within the same molecule, which can 
involve the Watson-Crick sequences as well as the geometry of the molecule 

[9]; 

3. one can use “instructed gluing” elements together with DNA-tiles [7]; the 
idea is to add structural sub-units, as gluing elements between tiles, along 
the self-assembly process; in many cases, the use of such sub-units decreases 
the complexity of the protocol: the number of elementary molecules becomes 
smaller and the assembly algorithms becomes more specific; 

4. the use of protecting groups, through which some of the potential interaction 
sites of the molecules are momentarily inhibited, is inherently a sequential 
protocol, applicable both to DNA objects [58] and to fractal assemblies [9]; 

5. the solid-support methodology in DNA nanotechnology [58] is an example of 
sequential assembly; it was used to construct the most complex DNA object 
to date, a truncated octahedron [59]; the step-wise synthesis of a square is 
illustrated in Figure 1 - here, enzymes intervene in some of the sequential 
steps. 
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8 Hierarchical Tiling 

A set of tiles {Ti, . . . ,T„} is self-consistent if for each Ti with binding site Di 
there is a tile Tj with binding site Dj such that h{Di) = Dj, for some isometry 
b. Notice that i need not be different from j. In particular, a single tile T is self- 
consistent with itself a it has at least two binding sites which are complementary 
to each other. 

Let {Ti, . . . ,Tn} be basic elementary tiles which assemble in a variety of tile 
complexes Si,..., Si, i.e. finite assemblies Si with unused binding sites. A set 
of tile complexes is self-consistent if for each Si with binding site Di there is a 
tile complex Sj with binding site Dj such that b{Di) = Dj, for some isometry b 
defined on tile complexes. New binding sites Di generated from the assembly of 
a tile complex (as in Figure 7) are allowed. 

A hierarchical tiling is an assembly X of tiles {T\, . . . ,T„} that is obtained 
by successive steps of assembly generating intermediary sets of tile complexes 
J-Q, . . . , Tm such that: 

1- -^0 = {T\, ■ ■ ■ , T„}; 

2. Ti = {Si^i, ..., Sij^}, for i = 1 . . . m, where each Sij is a tile complex in T-i; 

3. T is a, self-consistent set of tile complexes; 

4. A is an assembly of Sm,i, ■■■, Sm,im- 

The value of m is called order of the hierarchical tiling. A hierarchical tiling 
is non-trivial if for each family Ti there is at least one tile complex Sij which is 
not in T-i already. Notice that not all assemblies can be defined as hierarchical 
assemblies of order m, for m > 1. 

A dynamical model of hierarchical tiling. It can be defined by a repeated iter- 
ation of the random model for self-assembly presented in Section 5, where the 
tile complexes used as input tiles at step i-\-l are the complexes formed in S at 
the end of step i. In general, a hierarchical assembly is not a sequential assem- 
bly. It might happen though, that certain assembly processes are defined by a 
combination of sequential steps during the hierarchical self-assembly. 

Some concrete examples of hierarchical assembly. Suitable selection of structural 
units allows the design of molecular entities undergoing self-organisation into 
well-defined architectures, which subsequently may self-assemble into supramolec- 
ular fibrils and networks. The assembly of “infinite” tubes and spheres has been 
realized many times and in many laboratories with different kinds of molecules. 
A basic approach is to design a rod-like molecule with an hydrophobic end and 
a hydrophilic one. Then, one puts the molecules in different media and observes 
the formation of spheres, where the hydrophilic side of the molecules lies either 
inside or outside the sphere, depending on the properties of the medium. Alter- 
natively, one might observe the formation of a long tube where, again, on the 
surface one finds sides with identical hydrophilic/hydrophobic properties. The 
formation of spheres and tubes leads us to ask how these shapes might assemble 
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among themselves into supramolecular periodic or aperiodic structures. What 
other shapes do allow for the assembly of ID, 2D and 3D arrays of such tile 
complexes? 

Besides spheres, tubes and networks, chemists work on the design of synthetic 
molecules which lead to helical architectures of both molecular and supramolec- 
ular nature by hierarchical self-organization, or again to the formation of mush- 
room- like shapes and to a consequent assembly of such complexes into 3D arrays 
[18] (these arrangements are not regular, in the sense that they are not crys- 
tals). Mimicking nucleic-acid sequences, specific sequences of hydrogen bonding 
residues are led to act as structure-inducing codons, and such structural coding 
allows for the spontaneous but controlled generation of organized materials, e.g. 
[18]. 




Fig. 11. A variety of two-dimensional arrays that have been formed from DNA 
tiles. Panels (a) and (b) illustrate 2D arrays composed of DX and DX -\- J 
molecules. Panel (c) illustrates patterns obtained from TX molecules. Panel (d) 
illustrates an array made of DNA parallelograms. 



At a different scale, for nanoscale molecules, like DNA, a broader range of 
possibilities can be explored since all of the contacts can be forced to be of a 
Watson-Crick form, although many other types of interaction are possible (e.g.. 
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[60]). The ifferent shapes of tiles introduced at the end of Section 4, enabled the 
assembly of several different kinds of periodic ID and 2D arrays (see Figure 11). 
These hierarchical assemblies have order 2: single strands make the starting set 
of tiles, which assemble into specific intermediary molecular tiles (described in 
Section 4), and finally these molecular tiles self-assemble into a 2D array. 

Periodic assemblies in 3 dimensions are still an open problem. Protocols for 
the assembly have been proposed, but highly ordered programmed 3D arrange- 
ments have not yet been realized to resolutions below 1 nm in the laboratory 
for DNA tiles. Aperiodic arrangements, typically harder to assemble and ana- 
lyze than periodic assemblies, present an even greater challenge, because their 
characterization cannot rely on diffraction analysis in a simple fashion. 

Example 2. Fractal assemblies. [8,9] Fractal constructions are a special case of 
aperiodic assemblies. The algorithm here is simple: from a starting molecular 
shape, which can look like a square or a triangle, and is designed to interact 
with copies of itself, one constructs a molecule with the same shape but a larger 
size, and re-iterates the process to get larger and larger assemblies of the given 
shape. The difficulty lies in the design of a set of basic shapes which can self- 
assemble into new self-similar shapes of larger sizes, and whose binding sites 
are coded by self-similar coding. An appropriate coding is important to ensure 
that tile complexes will self-assemble and that undesired binding is avoided. The 
order of this hierarchical tiling, corresponding to the number of iterations of the 
algorithm, is m, for potentially any value of m. In practice, a chemist would be 
happy with m = 4, 5. 

These examples lead to some questions: within the set of feasible shapes and 
interactions, can we classify potential complexes? Once a complex is formed, can 
it be used as a building block to construct larger ID, 2D or 3D arrays? 

9 Size of the Assembly 

How can the size of an assembly be controlled? 

Rough termination is easy to induce. An obvious way is to limit the number 
of molecules in the solution. Another way is to use protecting groups, i.e. DNA 
molecules, which might be single strands for instance, whose binding sites are 
complementary to the binding sites of the tiles used in the assembly. The idea 
being that protecting groups might be added to the solution during the process 
of self-assembly to prevent new tiles from binding to available sites. 

Exact termination is a consequence of the coding for the termination. If a 
synthesis or an assembly is templated, it is always possible to limit growth, by 
leaving out the constituent that is coded at the terminal position, for instance. 
The algorithmic synthesis of triangular circuits illustrated in Figure 6, provides 
another example where this is done [7]. In general, exact size control of a DNA 
self-assembly is hard to achieve. A few protocols have been presented so far. 

In theory, DNA tiles can be used to “count” by creating boundaries with 
programmable sizes for ID, 2D and possibly 3D periodic assemblies. The idea 
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is to build periodic arrays of size n x m hy generating repeatedly the Boolean 
truth table for n entries until m rows of the table have been filled [54,56]. If 
this schema can be physically implemented, then self-assembly of precisely-sized 
nanoscale arrays will be possible. 

Fractal assemblies [8,9] can be thought of as a way to generate fixed geomet- 
rical shapes of controlled size. Besides the rectangular shapes of n x to arrays, 
one would like to have a way to grow assemblies with other shapes such as trian- 
gles, hexagons, etc. Fractal assembly allows us to do so by constructing objects 
with fixed sizes that are powers of some value: for instance, for the Sierpinski 
fractal, the size of the squares is 3^, where k is the dimension. 

10 Algorithmic Assembly 

The combination of different instructions in a “molecular program” has been 
used to design self-assembly systems which follow specific assembly pathways. 
This idea has its mathematical analogue in the work of Wang [48,49,50], who 
proposed a finite set of tiles mimicking the behavior of any Turing Machine. 

Wang tiles are squared tiles in whose binding sites are the four sides of 
the square, and whose interaction is possible on binding sites labelled by the 
same color. If Ti, T 2 ,. . . ,Tn are Wang tiles, then one asks that {Ti, T 2 , . . . , T„} 
be a self-consistent set. Once a set of Wang tiles is given, one asks whether the 
plane can be tiled with it, and what are the properties of the tiling, namely if 
the set generates periodic tiling only, or both periodic and non-periodic tiling, 
or aperiodic tiling only. 

The molecular realization of Wang tiles (where a square becomes a 4-arm 
branched molecule with Watson-Crick complementary sticky ends as binding 
sites) can, theoretically, be used to make computations [54]. This notion has 
not yet been realized experimentally in more than one dimension [30]. A three- 
dimensional framework for computing 2D circuits and constructing DNA-objects 
with given shapes, has been suggested [7], where again, DNA tiles mimic Wang 
tiles. It is important to stress that molecular tiles are not conceived to generate 
only uniform tiling of the plane, but on the contrary, they can be used to induce 
the assembly of objects of arbitrary shapes. 

Combinatorial optimisation problems: fixing a single shape. Two combinatorial 
problems have been stated in [3]. The first concerns minimum tile sets, i.e. given 
a shape, find the tile system with the minimum number of tile types that can 
uniquely self-assemble into this shape. The second concerns tile concentration, 
i.e. given a shape and a tile system that uniquely produces the given shape, 
assign concentrations to each tile-type so that the expected assembly time for 
the shape is minimized. The first combinatorial problem is NP-complete and the 
second is conjectured to be #P [3]. These problems have been formulated for 
any given shape even though only square tiles, i.e. Wang tiles, have been studied 
until now. 

Templates and fixed shapes. Can one find a small set of relatively simple tiles 
such that, starting from a template supporting a linear code (that may be a DNA 
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or RNA molecule incorporated into a macromolecular complex), the assembly 
process will create a given three dimensional shape in the space? We think here of 
interacting tiles performing a transformation from labeled templates into three 
dimensional structures and we ask what kind of transformations can be realized 
in this way [7]. Also, one wants to understand how much the complexity of the 
construction depends on the complexity of the tiles, where the latter can be 
measured by the number of the binding sites of the tiles, the size of the sets Bij , 
etc. 

Combinatorial optimisation problems: fixing a “family” of shapes. Fractal as- 
sembly provides an example of an iterative algorithm for self-assembly which 
generates fractals of arbitrary dimension and not just a single shape with a 
given size. For each dimension, the building blocks necessary to build the corre- 
sponding fractal shape need to satisfy the same self-similar properties, and the 
design of a tile set which satisfies these properties is not obvious. For instance, 
given a Sierpinski square fractal and an iterative algorithm that produces arbi- 
trarily large instances of this shape, is there a set of Wang tiles that can uniquely 
assemble into any fractal size? It is not at all clear that a set of Wang tiles with 
self-similar coding exists. In [9] a set of tiles, whose boundaries are characterized 
by both a coding sequence and a geometrical shape, is proposed. Does geometry 
have to be included in the coding of the tile boundaries to impose extra control 
on the assembly? What is the minimum number of tiles necessary to generate a 
family of shapes? 

In general, let an algorithm for self-assembly be fixed. What are the properties 
of the tiles which are necessary to realize the algorithm? 

Dynamic tiling. A molecular feature that has been used in algorithmic self- 
assembly is the possibility to program and change the status of a molecule. 
This means that the molecule passes in time through several possible physical 
conformations, i.e. geometrical shapes. In DNA nanotechnology, this has been 
done by using “template” molecules (programmable tiles) that interact with 
DNA single strands [47,46]: the pairing of the single stranded DNA present in 
the solution to a single strand subsequence of the tile induces this latter to 
change its conformation. Because of these conformational changes, tiles get a 
different status during the assembly, with the effect that one is able to control 
the dynamics of the algorithm and the direction of the assembly. As a result, 
one can generate different architectures out of the same set of tiles by varying 
their conformations. 

Example 3. One can imagine a basic molecular system that is fundamentally a 
layer of programmable tiles which can guide the assembly of multiple layers of tiles 
above it [7]. In the 2-dimensional case this device can compute tree-like boolean 
circuits, and in 3D, it can induce finite regular and less regular shapes. Multiple 
regular layers are obtained by completely filling up the template board: a new 
layer of tiles will cover-up the old one and will play the role of a new board in 
allowing the creation of a third layer, and so on. “Walls” with specified “height”, 
or discrete irregular shapes are obtained by partially filling-up the board, and 
this can be achieved by inserting appropriate coding in the programmable tiles 
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that form the template board. The coding will discriminate what are the tiles 
that will interact with new ones and what are those that will avoid interaction. 

In the example, a change in the programming of the board induces the for- 
mation of different shapes out of the same input set. This suggests that a formal 
notion of complexity describing self-assembly of molecular systems cannot be 
based merely on the variety of shapes that potentially can be assembled, but 
rather on the much larger variety of algorithms that allow their assembly. 

DNA computing. Last, we want to mention the effort in designing algorithms for 
DNA-computation. The landmark step is in [1], where DNA is used to solve an 
instance of the Hamiltonian Path problem, asking to establish whether there is 
a path between two cities, given an incomplete set of available roads. A set of 
strands of DNA is used to represent cities and roads (similar to the description 
of the 3-coloring problem in Section 6), and the coding is such that a strand 
representing a road would connect (according to the rules of base-pairing) to any 
two strands representing a city. By mixing together strands, joining the cities 
connected by roads, and weeding out any wrong answers, it has been shown that 
the strands could self-assemble to solve the problem. 

The first link between DNA-nanotechnology and DNA-computation was es- 
tablished in [54] with the suggestion that short branched DNA-molecules could 
be “programmed” to undergo algorithmic self-assembly and thus serve as the 
basis of computation. Other work has followed as [30,34,25]. 

11 Discussion 

Most examples in this paper were based on Watson-Crick interactions of DNA 
molecules. Other kinds of interaction, usually referred to as tertiary interactions, 
can be used to lead a controlled behavior in the assembly of DNA molecules, 
for example, DNA triplexes [15], tecto-RNA [21] and G-quartet formation [42]. 
In the combinatorial DNA constructions that we presented, tertiary interactions 
were carefully avoided with the goal of maximizing control on the dynamics of 
the assembly. Tertiary interactions are not as readily controlled as Watson-Crick 
interactions. The next generation of structural DNA nanotechnologists will be 
likely to exploit this wider range of structural possibilities and it appears possible 
that new combinatorics might arise from these options. 
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Abstract. We introduce some classes of splicing languages generated 
with simple and semi-simple splicing rules, in both, the linear and circular 
cases. We investigate some of their properties. 



1 Introduction 

The operation of splicing was introduced as a generative mechanism in formal 
language theory by Tom Head in [9], and lays at the foundation of the field of 
DNA based computing (see [11] and [14].) 

We are concerned in this paper with some classes of languages which arise 
when considering very simple types of splicing rules. Our main inspiration is the 
study of [12], where the notion of simple H system and the associated notion of 
simple splicing languages have been introduced. 

A simple splicing rule is a rule of the form (a, 1; a, 1) with a G A a, symbol 
called marker. More precisely, we will call such a rule a (l,3)-simple splicing 
rule, since the marker a appears on the 1 and 3 positions of the splicing rule, 
and since one can conceive of (z,j)-simple splicing rules for all pairs (t,j) with 
z = 1,2, j = 3,4. We denote by SH{i,j) the class of languages generated by 
simple splicing rules of type (z, j). The paper [12] focuses basically on the study 
of the SH{1, 3) class (which is equal to the SH{2,4) class and is denoted by SH). 
Only towards the end of the paper the authors of [12] show that there are three 
such distinct and incomparable classes, SH = SH{l,i) = SH{2,4), SH{2,H) 
and SH (1,4), of (linear) simple splicing languages. They infer that most of the 
results proven for the SH class will hold also for the other classes, and point out 
towards studying one-sided splicing rules of radius at most k: rules (zzi, 1; U 3 , 1) 
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with |mi| < k, 1^3 1 < k. Note that the “one-sidedness” they are considering is of 
the (1,3) type, and that other (t, j) types can be considered, provided they are 
distinct from (1,3). 

The notion of semi-simple splicing rules introduced by Goode and Pixton in 
their paper [8] seems to follow the research topic proposed by [12] considering 
the simplest case, with k = 1. More precisely, semi-simple splicing rules are rules 
of the form (a, 1; b, 1) with a,b € A, symbols which again we will call markers. 
Again the one-sidedness is of the (1,3) type, and we denote by SSH{1,3) the 
class of languages generated by such rules. 

Goode and Pixton state explicitly that their interest lies in characterizing 
regular languages which are splicing languages of this particular type (we will 
be back to this problem later), and not in continuing the research along the lines 
of [12]. 

Instead, we are interested here in analyzing different classes, for their sim- 
ilarities or dissimilarities of behavior. We show that the classes SSH{1,3) and 
SSH(2, 4) are incomparable, and we provide more examples of languages of these 
types. 

The connections between (linear) splicing languages and regular languages 
are well explored in one direction: it has been shown that iterated splicing which 
uses a finite set of axioms and a finite set of rules generates only regular lan- 
guages (see [6] and [15]). On the other hand, the problem of characterizing the 
regular languages which are splicing languages is still open. Very simple regular 
languages, like (aa)*, are not splicing languages. The problem is: starting from 
a regular language L C V*, can we find a set of rules R, and a (finite) set of 
axioms A C V*, such that L is the splicing language of the H system (V, A, R)1 
This problem is solved for simple rules in [12] with algebraic tools. The problem 
is also addressed by Tom Head in [10], and solved for a family of classes, a family 
to which we will refer in Section 5. In [8] the problem is solved for (l,3)-semi- 
simple rules. The characterization is given in terms of a certain directed graph, 
the (1, 3) arrow graph canonically associated to a regular language respected by 
a set of (1, 3)-semi-simple rules. We show in Section 4 that their construction 
can be dualized: we construct the (2,4) arrow graph for a language respected 
by (2, 4)-semi-simple splicing rules. We obtain thus an extension of the charac- 
terization from [8]. Among other things, this enables us to prove for languages 
in SSH{2,A) properties valid for languages in SSH (1,3). The construction also 
makes obvious the fact that the other two semi-simple types have to be treated 
differently. 

In Section 5 we try to find more connections to the study in [10] and point 
out certain open problems in this direction. 

The second part of this paper is concerned with circular splicing. Splicing of 
circular strings has been considered as early as the pioneering paper [9], but it 
does not occupy in the literature the same volume as that occupied by linear 
splicing. With respect to the computational power, it is still unknown precisely 
what class can be obtained by circular splicing, starting from a finite set of rules 
and a finite set of axioms (see [1], [15]). It is known that finite circular splicing 
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systems can generate non-regular circular languages (see [17], [15]). On the other 
hand, there are regular circular languages (for instance the language ''{aa)*b) 
which cannot be generated by any finite circular splicing system (with Paun 
type of rules) (see [2], [3]). The fact that the behavior in the circular case is 
by no means predictable from the behavior in the linear case is made apparent 
by the example of the "(aa)* circular language, which is both a regular and a 
splicing language (see [2], [3]). 

This has motivated the beginning of the study of simple circular H systems 
in the paper [5]. We recall briefly in Section 6 some results of [5], which are 
completed here by some closure properties. 

In Section 7 we introduce the semi-simple circular splicing systems and first 
steps in the study of such systems are made by proving the incomparability of 
classes SSH°{1, 3) and SSH°{2, 4), a situation which is totally dissimilar to the 
simple circular case, where more classes colapse than in the simple linear case. 

The following table summarizes the notations used for the different types of 
H systems and the respective classes of languages considered throughout the 
paper, and the main references for the types already studied in the literature. 



Table 1. Types of splicing systems and languages 





linear 


case, V* 


circular case, V° 


simple 


SH 


[12] 


SH° [5] 


semi-simple 


SSH 


[8] 


SSH° 


fc-simple 


SkH 


[10] 




fc-semi-simple 


SSkH 





2 Preliminaries and Notations 

For a finite alphabet V we denote by V* the free monoid over V, by 1 the empty 
string, and we let = P* \ {!}. For a string w G V* we denote by |rc| its 
length, and if a G P is a letter, we denote by |rc|a the number of occurrences of 
a in w. We let mi : V* — > V* denote the mirror image function, and we will 
denote still by mi its restriction to subsets of V* . For a word w £ V*, we call 
u gV* a, factor of w, if there exist x,y gV* such that w = xuy. We call u G V* 
a factor of a language L, if m is a factor of some word w G L. 

On V* we consider the equivalence relation given by xy ~ yx for x,y G V* 
(the equivalence given by the circular permutations of the letters of a word) . 

A czrcrtZar string over V will be an equivalence class w.r.t. the relation above. 
We denote by "w the class of the string w G V* . We denote by V° the set of 
all circular strings over V, i.e., V° = V* j ~. The empty circular string will be 
denoted '1, and we let V® = \ {'!}. Any subset of V° will be called a circular 

language. 

For circular words, the notions of length and number of occurrences of a 
letter, can be immediately defined using representatives: |'w| = |w|, and |"ru|a = 
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|w|a- Also the mirror image function mi : V* — > V* can be readily extended to 
mi : V° — > V°: if "w is read in a clockwise manner, mi{"w) = "mi{w) consists of 
the letters of w read in a counter-clockwise manner. We will denote composition 
of functions algebraically. 

To a language L CV* we can associate the circular language Cir{L) = {"w \ 
w € L} which will be called the circularization of L. The notation L° is also 
used in the literature. 

To a circular language C C we can associate several linearizations, i.e., 
several languages L C V* such that Cir{L) = C. The full linearization of the 
circular language C will be Lin{C) = {w | 'w G C}. 

Having a family FL of languages we can associate to it its circular counterpart 
FL° = {Cir{L) \ L G FL}. We can thus speak of FIN°, REG°, etc. A circular 
language is in REG° if some linearization of it is in REG. 

Next we recall some notions on splicing. A splicing rule is a quadruple r = 
(ui,U2', U3, U4) with Ui G V* for i = 1,2, 3 , 4 . The action of the splicing rule r on 
linear words is given by 

(xiUlU2X2,yiU3U4y2) \~r XlU\U4y2. 

In other words, the string x = X1U1U2X2 is cut between u\ and U2, the string 
y = 2/1U3M4J/2 between U3 and U4, and two of the resulting substrings, namely 
and U4y2, are pasted together producing z = xiU\U4y2. 

For a language L CV* and a splicing rule r, we denote 

r(L) = {z I (x,y) \~r z for some x,y G L}. 

We say that a splicing rule r respects a language L if r{L) C L. In other 
words, L is closed w.r.t. the rule r. A set of rules R respects the language L if 
R{L) C L. 

For an alphabet V, a (finite) A C V* and a set R of splicing rules, a triple 
S = {V, A, R) is called a splicing system, or FI system. The pair cr = {V, R) is 
called a splicing (or H) scheme. 

For an arbitrary language L C V*, we denote cr°(L) = L, a{L) = LU {z \ 
for some r G R,x,y G L, (x,y) \~r z}, and for any n > 1 we define = 

(t”(L) U ct(ct”(L)). The language generated by the F[ scheme a starting from 
L CV* is defined as 

a*{L) = U a"(L). 

n>0 

The language generated by the F[ system S = {V,A,R), denoted L{S), is 
L{S) = u*{A), i.e., what we obtain by iterated splicing starting from the strings 
in the axiom set A. We recall that L{S) can be also characterized as being the 
smallest language L CV* such that: 

(i) ACL; 

(ii) r{L) C L for any r G R. 

In a similar manner we define splicing on circular languages. The only differ- 
ence is that we start from circular strings as axioms, A CV° , and the application 
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of a splicing rule produces a circular string from two circular strings (as we will 
formally define in section 6). 

Other notations and definitions will be introduced, whenever necessary, 
throughout the paper. 



3 Simple Linear H Systems and Languages 

We recall in this section some of the results of [12] on simple linear H systems. 

A simple (linear) splicing rule is a splicing rule of one of the four following 
forms: (a, l;a, 1), (l,a;l,o), (l,a;a, 1), (a, l;l,a), where a is a symbol in V, 
called marker. The action of the four types of simple splicing rules on strings is 
illustrated below: 



type(l,3): {xax' ,yay') l-(a_i.a,i) xay' , 
type{2,4): {xax' ,yay') l-(i_a;i,a) xay' , 
type{2,3) : {xax',yay') xy', 

type{l,4) : {xax',yay') l-(a_i.i_o) xaay' . 

We note that the action of rules of type (1, 3) and (2, 4) coincide. 

A simple H system is a triple S = {V, A, R) consisting of an alphabet V, a 
finite set A C V* called the set of axioms (or initial strings), and a finite set R 
of simple splicing rules of one of the types (z,j), with i = 1,2, j = 3,4. 

The language generated by a simple H system S as above is defined as usual 
for splicing schemes 

L = a*{A). 

where cr = {V, R). If the rules in R are of type (z, j), this language will be called 
an {i, j)-simple splicing language. The class of all (z, j)-simple splicing languages 
will be denoted by SH{i,j). We have SH{1,3) = SH{2,4) and let SH denote 
this class. 

Theorem 1. (Theorem 11 of [12]) Each two of the families SH, SH{1,A) and 
SH{2,3) are incomparable. 

We have, from the more general results of [15] and [6], that all three classes 
contain only regular languages, i.e., SH, SH{1,4), SH{2,3) C REG. In [12] the 
inclusion SH C REG is also proved directly, using a representation result. 

The inverse problem, of characterizing those regular languages which are 
simple splicing languages is solved via an algebraic characterization. 

4 Semi-simple Linear H Systems and Languages 

Semi-simple splicing rules were first considered by Goode and Pixton in [8]. In 
[12] it is mentioned that “it is also of interest to consider one-sided splicing rules 
of radius at most k: (zzi, 1; ZZ 2 , 1) with |zzi| < k, \u 2 \ < fc”. The semi-simple 
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splicing rules are one-sided splicing rules of radius k = 1. Moreover, the “one- 
sidedness” of the rules considered by Goode and Pixton is of the (1,3) type. We 
present the basic definitions for the four possible types. 

For two symbols a,b G V a semi-simple (linear) splicing rule with markers a 
and 5 is a splicing rule of one of the following four types: (a, 1; b, 1), (1, a; 1, b), 
(l,a;5,l), (a,l;l,5). 

The action of the four rules on strings is obvious. 

A semi-simple H system is an H system S = (V, A, R) with all rules in R 
semi-simple splicing rules of one of the types (i,j), with z = 1, 2, j = 3, 4. 

The language generated by a semi-simple H system S as above is defined as 
usual for splicing schemes, L = a* {A), where a = (V,R). If the rules in R are 
semi-simple rules of type (i,j), this language will be called an {i, j)- semi- simple 
splicing language. The class of all (z, j)-semi-simple splicing languages will be 
denoted by SSH{i,j). 

In [8] only the class SSH{1,3) is considered. Since in [8] only one example 
of a semi-simple language of the (1,3) type is given, we find it useful to provide 
some more. 

Example 1 Consider = (jo, 6|, |a5a|, |(a, 1; 6, 1)1). The language generated 
by Ss is La = HS 3 ) = aa+ U aba+ G SSH{1,3). 

Example 2 Consider S a = {{a,b\,{aba?t,{(l,a\ 1,6)1). The language generated 
by S 4 is L 4 = L{S 4 ) = b+aUab+a G SSH{2,4)- 

The following result shows that, even for the types (1,3) and (2,4), which 
share some similarity of behavior, unlike in the case of simple rules, the respective 
classes of languages are incomparable. 

Theorem 2. The classes SSH{1,3) and SSH(2,A) are incomparable. 

Proof: Note first that simple splicing languages in the class SH are in the inter- 
section SSH{1,3) n SSH{2,A). 

The language Li = a+ U a+a6U a6a+ U aba'^b belongs to SSH{1, 3), but not 
to SSH{2,A). 

For the first assertion, L\ is generated by the (1, 3)-semi-simple H system 
S'! = ({a, 6}, {0606}, {(a, 1; 6, 1)}). We sketch the proof of Li = L{Si). Denote 
by r the unique splicing rule, r = (a, 1; 6, 1). For the inclusion L\ C i(S'i), note 
first that 



(o|6a6, a6a6|) \~r a, 

(a|6o6, a6|a6) 1-,^ a^b. 

Next, if a'^b G T(S'i), then a”, a"+^6 G T(S'i), for any natural n>2, since 

(a"|6, a6a6|) \~r a", 

(a"|6, a6|a6) \~r aC^^b. 

Thus, by an induction argument, it follows that a+ C L{Si) and a'^ab C L{Si). 
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Similarly, using induction, from 

{aba\b, abab\) 1-,^ aba, 

{aba\b, ab\a^) \~r aba"'~^^, for any n > 1, 

it follows that aba~^ C L(S'i), and from 

(o6a|6, a6|a6) \~r aba^b, 

(aba^\b,ab\ab) \~r aba'^'^^b, for any n > 2, 

it follows that aba'^b C L{Si). 

For the other inclusion, L{S\) Q Li,we use the characterization of ^(51) as 
the smallest language which contains the axioms, and is respected by the splicing 
rules. It is a straightforward exercise to prove that Li contains the axiom of S\, 
and that it is respected by the splicing rule of S\ . 

To show that L\ ^ SSH{2,4), note that each word in Li has at most two 
occurrences of b. Suppose Li were in SSH(2,4:). Then it would have been gen- 
erated by (and thus closed to) rules of one of the forms: (l,a; 1,6), (1,6; l,a), 
(1, a; 1, a), (1, 6; 1, 6). But we have: 

(a6|a6, a|6o6) abbab ^ Li, 

{aba\b, \abab) a) abaabab ^ Li, 

(ab\ab, \abab) F(i a) ababab ^ Li, 

(aba\b,a\bab) ababab^ ii, 

where we have marked the places where the cutting occurs, and all the words 
obtained are not in Li because they contain three occurrences of 6. 

The language L 2 = 6+ U abb'^ U b'^ab U ab'^ab belongs to SSH{2, 4), but not 
to SSH{1,S). 

For the first assertion, L 2 is generated by the (2, 4)-semi-simple H system 
S 2 = ({a, 6}, {a6o6}, {(1, a; 1, 6)}). The proof is along the same lines as the proof 
of Li G SSH{l,i). 

For the second assertion, note that each word in L 2 has at most two occur- 
rences of a. If L 2 were in SSH{1, 3) it would be closed to rules of one of the forms 
(a, 1; 6, 1), (6, 1; a, 1), (a, 1; a, 1), (6, 1; 6, 1). But, even starting from the axiom, 
we can choose the cutting places in such a way as to obtain words with three 
occurrences of a, and thus not in L 2 ' 

{aba\b, ab\ab) 1 ) abaab ^ L 2 , 

(a6o6|, a|6o6) abab^ab ^ L 2 , 

{aba\b, a\bab) F(ai.ai) ababab^ L 2 , 

{abab\, ab\ab) ababab^ L 2 . ^ 

The languages in the proof of Theorem 2 also provide examples of semi-simple 
splicing languages which are not simple. 
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For the (1,3) type, we have L\ G S'5'iJ(l,3) \ S'iJ(l,3). L\ is not respected 
by the simple rules (a, 1; a, 1) and (6, 1; b, 1); for instance, 

(a6a"*6|, a6|a") abaJ^ba"' ^ Fi, 

(a6a^|a^6, a|6a"6) oba^baT'b ^ Li. 

In a similar manner one can show that L 2 G SSH{2, 4) \ SH{2, 4). 

Splicing rules can be seen as functions from V* x V* to 2^ : ii r = 
(ui, U 2 ; U 3 , M 4 ) and x,y G V*, then r{x,y) = {z € V* \ {x,y) \~r z}. Of course, if 
there is no z such that (x, y) \~r z (that is, x and y cannot be spliced by using 
rule r), then r(x, y) = 0. We denote by inv the function inv : Ax B — > B x A, 
defined by inv{x, y) = {y, x), for x £ A, y G B, and arbitrary sets A and B. 

We have a strong relationship between semi-simple splicing rules of types 

(1.3) and (2,4). 

Proposition 1. Let a.b G V be two symbols. The followinq equality of functions 
from P* X P* to 2'^* holds: 

(a, 1; b, l)mi = inv{mi x mz)(l, b; 1, a). 

Proof. Let x,y G V* . If \x\a = 0 or \y\b = 0, or both, then, clearly, both 
(a, 1; 6, l)mi and inv{mi x mi)(l, b; 1, a) return 0. For any xiaa :2 G V*aV* and 
yiby 2 G V*bV* , we have, on one hand: 

{xia\x 2 ,yib\y 2 ) Xiay 2 — >mi mi{y 2 )ami{xi). 

On the other hand, inv{mi x mi){xiax 2 ,yiby 2 ) = {mi{y 2 )bmi{yi),mi{x 2 )a 
mi{xi)), and 

{mi{y 2 )\bmi{yi),mi{x 2 )\ami{xi)) mi{y 2 )ami{xi). 

Corollary 1. There exists a bijection between the classes of languages 
SSH{1,3) and SSH{2,4:). 

Proof. We construct tp : SSH{1,3) — > S'S'iL(2,4) and if : SSP[{2,4) — > 
SSH{1,3). First, making an abuse of notation, we define tp and if on types 
of semi-simple splicing rules, by: 

p{a,l;b,l) = (1,^; l,a), 

if{l, 6; 1, a) = (a, 1; 6, 1), for all a,b G V* . 

p transforms a (1, 3) rule into a (2, 4) one, and if makes the reverse transforma- 
tion. Obviously, if{p{r)) = r, and p{if{r')) = r', for any (1,3) rule r, and any 

(2.4) rule r' . For a set of (1,3) rules, R, let p{R) denote the corresponding set 
of (2,4) rules, and for R' , set of (2,4) rules, let if{R') denote the corresponding 
set of (1,3) rules. 
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Let S = (V,A,R) be a (1,3) semi-simple splicing system. Again, making 
an abuse of notation, we let ip{S) denote the (2,4) semi-simple splicing sys- 
tem ip{S) = (y, mi{A) , ip{R)) . For S' = {V,A',R') a (2,4) semi-simple splic- 
ing system, let ip{S') denote the (1,3) semi-simple splicing system tp{S') = 
{V,7m{A')y{R')). 

Now we define (p and V' on the respective classes of languages by: p{L{S)) = 
L{ip{S)), for L{S) G SSH{1,3), and ^{L{S')) = L{yS')), for L{S') G 
SSH{2,A}. Note that L{ip{S)) = mi{L{S)), and also L{tj:{S')) = mi{L{S')), 
thus ip and i/' aro given by the mirror image function, and are obviously inverse 
to each other. □ 

The above results emphasize the symmetry between the (1, 3) and the (2,4) 
cases, symmetry which makes possible the construction which follows next. 

Let V be an alphabet, L CV* a, language, and R{L) the set of splicing rules 
(of a certain predefined type) which respects L. For a subset of rules R C R{L) 
we denote by cr = (R, i?) a splicing scheme. We are looking for necessary and 
sufficient conditions for the existence of a finite set A dV* such that L = a* {A). 
For R{L) the set of (1, 3)-semi-simple splicing rules which respect L, Goode and 
Pixton have given such a characterization, in terms of a certain directed graph, 
the (1, 3)-arrow graph canonically associated to a pair (cr, L). Their construction 
can be modified to accomodate the (2, 4)-semi-simple splicing rules. 

We present the main ingredients of this second construction, which we will 
call the {2,4)-arrow graph associated to a pair (cr, L), where a = (V,R) is a 

(2. 4) -semi-simple H scheme (all rules in R are of the (2, 4)-semi-simple type). 
Enrich the alphabet with two new symbols, V = PU{S', T}, enrich the set of 

rules i? = i?U {(1,5; 1,5)}U{(1,T; 1,T)}, take a = (V,R), and L = SLT. The 

(2.4) -arrow graph of (cr,L) will be a directed graph G, with the set of vertices 
V (G) = V, and a set of (2, A)-edges E{G) dV xV* xV defined in the following 
way: a triple e = (b',w,b) is an edge (from b' to b) if there exists a (2,4) rule 
r = (1, a; 1, 6) G R such that b'wa is a factor of L. 

In order to stress the “duality” of the two constructions, let us recall from 
[8] the definition of a (l,3)-edge: a triple e = {a,w,a') is an edge (from a to a') 
if there exists a (1,3) rule r = (a, 1; 6, 1) G .R such that bwa' is a factor of L. 

We list below some results and notions, which are analogous to the results 
obtained for the (1,3) case in [8]. 

Lemma 1. We have a{L) d L. If AdV* then a*{SAT) = Sa*{A)T. 

Proof. For x = SwiT and y = S 1 U 2 T two words in L we have 
(x,y) l-(i,s;i,s) y&L, 

( 2 ;, y) k (1,T;1,T) X G L, 

(x,y) \~r Sr{w\,W 2 )T G L, for any other r G R. 



Lemma 2. (S,w,T) is a (2,4) edge iff SwT G L. 
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Proof. By definition, if (S,w,T) is a (2,4) edge, then there exists a (2,4) rule 
r = (l,a;l,T) C R such that Swa is a factor of L. The only rule in R involving 
T is (1,T; l,r), thus a = T, and SwT is a factor of L. But L C SV*T, thus 
SwT G L. Conversely, if SwT G L, then SwT is a factor of L, and since rule 
(1,T; 1,T) G R, we have that (S,w,T) is a (2,4) edge. □ 

The product of two adjacent edges in G, ei = (bo, wi, 6i) and C2 = {bi,W2, 62), 
is defined as the triple eiC2 = {ho,wibiW 2 ,b 2 ). 

Lemma 3 . (The closure property) Whenever e\ and 62 are adjacent edges in 
G, eiC2 is an edge of G. 

Proof. If ei = (bo,wi,bi) and 62 = (bi,W2,b2) are adjacent (2,4) edges in G, 
then there exist the (2,4) rules ri = (l,ai; l,6i) and r2 = (1,02; 1,^2) in R, and 
there exist the strings xo,yi,xi,y 2 G V, such that the words xoboWiOiyi and 
x\biW2a2y2 are in L. Using r\ to splice these words we obtain 



{xoboWi\axyx,xi\hxW2a2y2.) xoboWibiW2a2y2 G L, 



so boWibiW 2 a 2 is a factor of L. This fact, together with having rule T2 in R, 
makes 6162 = {bo,wibiW 2 ,b 2 ) an edge of G. □ 

A path in G is a sequence tt =< ei,---,e„ > of edges = {bk-i,Wk,bk), 
1 < k < n, every two consecutive ones being adjacent. The label of a path as 
above is A(7 t) = b^wibi ■ ■ -Wnbn- A single edge e is also a path, < e >, thus its 
label A(e) is also defined. 

Lemma 4. For tt =< ei, • • • , e„ > a path in G, the product e = ei • • • exists 
and is an edge ofG, whose label equals the label of the path, i.e., A(e) = A(7 t). 

Proof. Using Lemma 3 and the definitions, one can prove by straightforward 
calculations that the product of adjacent edges is associative, hence the product 
of n edges of a path can be unambiguously defined, and is an edge. □ 

The language of the (2,4) graph G, denoted L{G), is the set of all labels of 
paths in G, from S to T. 

Lemma 5. L{G) = L. 

Proof. From Lemma 2 we have L C L{G), since SwT G L implies (S,w,T) is 
an edge, and \{S,w,T) = SwT G L(G) by definition. For the other implication, 
if 7T is a path in G from S to T, then, according to Lemma 4 there exists the 
product edge e = {S, w, T) of all edges in tt and A(7 t) = A(e) = SwT G Z. □ 

The following is the analogue of the prefix edge property of (1,3) arrow 
graphs from [8]. 

Lemma 6. (Suffix edge) If {b" ,ub'v,b) is an edge of G, with b' G V, then 
(b',v,b) is an edge ofG. 
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Proof. From {b”,ub'v,b) being an edge, there exists a rule r = (l,a;l,6) G R 
such that b"ub'va is a factor of L. But this makes b'va a factor of L, which, 
together with the existence of rule r, establishes that (b',v,b) is an edge. □ 

By a prime edge we mean an edge which is not decomposable into a product 
of edges. Similar to the (1,3) case, a (unique) decomposition into prime edges is 
possible. 

Lemma 7. For every edge e there exists a sequence of prime adjacent edges 
Cl, • • • , e„, such that e = e\ • ■ • Cn- 

The prime (2, 4) -arrow graph is the subgraph Gq of G with all edges prime. 
Lemma 8. L{Go) = L. 

Proof. From Lemma 7, L{Gq) = L{G), and from Lemma 5, L{G) = L. □ 

Lemma 9. If there exists a set A CV* such that L = a*{A), then , if {bo, w, bi) 
is a prime edge in G, we have that w is a factor of A. 

Proof. For any edge e = {bo, w, bi) there is a rule r = (1, ai; 1, bi) G R such that 
bowai is a factor of Z = a* {A). We can thus define 

N{e) = min{n \ there exists r = (1, ai; 1, b\) G R such that 
bowai is a factor of a"' {A)}. 

For e prime with N{e) = 0, bowa\ is a factor of A, thus w is a factor of A, so 
the assertion is true. Let e = {bo,w,b\) be a prime edge with N{e) = n > 0, 
and suppose the assertion is true for all prime edges e! with N{e') < n. Then, 
there exist a rule ri = (l,ai; l,6i) G R, and a string zo = xobowaiyi in a^{A). 
Because n > 0, there exist a rule r = (l,a;l,6) G R, and strings 2; = xay' , 
z' = x'by G a^~^{A), such that {z, z') \~r Zo- We analyze now the equality of the 
two decompositions of Zo, 



Zo = xby = Xobowaiyi, 



by comparing the suffixes: 

Case 1: \by\ > \bowaiyi\. Then bowai is a factor of by, which makes it a factor 
of z' = x'by G (7"“^(A), which contradicts N{e) = n. 

Case 2: \by\ < |aij/i|. Then bowai is a factor of x, which makes it a factor of 
z = xay' G (t"“^(A), which again contradicts N{e) = n. 

Case 3: |aij/i| < \by\ < \bowaiyi\. Then (we have an occurrence of b inside w) 
w = ubv, from which follows the existence of the suffix edge of e, 62 = {b, v,bi) G 
E{G). Formally, e = {bo,u,b)c 2 , with e and 62 edges. If ci = {bo,u,b) is also an 
edge, this would contradict the primality of e. But, from xby = xoboubvaiyi, in 
this case, it follows that x = xobou, thus z = xobouay' . This makes boua a factor 
of z, which, together with the existence of rule r € R, makes ei an edge, and 
thus leads to contradiction. The only remaining case is: 
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Case 4: \by\ = |aiyi|. Then x = xobow. Thus z = xay' = xoboway, making 
bowa a factor of z, which, together with the existence of rule r in R, make 
e' = (&o, rc, 6 ) an edge. Moreover, e' is a prime edge. (If it were not, there would 
exist edges ei, C 2 such that e' = eiC 2 . If ei = {bo,u,b') and C 2 = (b',v,b), then 
w = ub'v. It would follow that = (b',v,bi) is a suffix edge of e, and thus 
e would admit the decomposition e = 6162 , contradicting its primality.) Since 
N{e') < n — 1, the assertion is true for e', so w is a factor of A. □ 

The following characterization of (2, 4)-semi-simple splicing languages in 
terms of (2,4)-arrow graphs holds. 

Theorem 3. (Analogue of Theorem 30.1 of [8]) For a language L C V* , R C 
R{L), where R{L) is the set of {2, A) -semi- simple rules which respect L, take 
a = (V, R) and consider Gq the prime (2, 4)-arrow graph for {cr,L). Then there 
exists a finite A C V* such that L = a* (A) if and only if Gq is finite. 

Proof. From Lemma 9, if A is finite, then so is Gq. 

Conversely, suppose Gq is finite. For each prime edge e = (bi,w,bo) and 
each rule (l,ao! li^o) in R select one word in L which has biwbo as a factor. 
Let A = SAT be the set of these strings. We prove next that L = a* {A), thus 
L = a*{A). 




Fig. 1. The (1, 3) prime arrow graph of Li = a'*' U a~^ab U aba'^ U aba'^b 

Take any edge in G from S to T, and consider e = e„e„-i • • • ei its prime 
factorization, with Cfc = {bk,Wk,bk-i), 1 < k < n. For every k, select a rule 
Tfe = (l,afe_i; l,bk-i) and a word Zk in A which has bkWkak-i as a factor. For 
k = 1, bo = T, thus ao = T, and biW\T is a factor (actually a suffix) of z\. For 
k = n, bn = S, thus Swnan-i is a factor (actually a prefix) of z„. We splice Z 2 
and Z\ using rule ri, next Z 3 and the previously obtained string using r 2 , and 
so on, until splicing of and the previously obtained string, using r„_i, finally 
gives us A(e). □ 

Corollary 2. A language L C V* belongs to the class SH(2,A) if and only if 
the prime (2,4) arrow graph constructed from {a,L), with (a = (V,R(L)), is 
finite. 
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Fig. 2. The (2, 4) prime arrow graph of L 2 = b'^ U abb~^ U ab'^ab U b'^ab 

The languages from Theorem 2 have their prime arrow graphs depicted in 
Figures 1 and 2. 

The languages from Examples 1 and 2 have their prime arrow graphs depicted 
in Figures 3 and 4. 




Fig. 3. The (1,3) prime arrow graph of L3 = aa+ U a6a+ 

For simple splicing languages in the class SH=SH{1,3) = S'iF(2,4), the 
(1,3) arrow graph and the (2,4) arrow graph coincide, because: 

- (a, IV, b) is a (1, 3)-edge iff there exists a (1, 3) rule (a, 1; a, 1) such that awb 
is a factor of L; 

- {a,iv,b) is a (2,4)-edge iff there exists a (2,4) rule (1,6; 1,6) such that awb 
is a factor of L. 



Example 3 ([12]) Let = ({a, 6, c}, {a6aca, aca6a}, {(6, 1; 6, 1), (c, 1; c, 1)}) be 
a (1,3) simple splicing system. The generated language is: 

L = L{Si) = {abac)~^a U (abac)* aba U {acab)'^a U (acab)*aca. 

The same language can he generated by the (2,4) simple splicing system S 2 = 
({a, 6, c}, {a6aca, acaba}, {(1, 6; 1, 6), (1, c; 1, c)}), i.e., we have L = L{S 2 ). 
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Fig. 4. The (2,4) prime arrow graph of L 4 = b'^a U ab'^a 




Fig. 5. The (1,3) and also the (1,4) prime arrow graph of L = {abac)~^a U 
{abac)* aba U {acab)'^a U {acab)*aca 

Note that the axiom set of S\ is closed to mirror image, and equals the axiom 
set of 82 - The (1, 3) arrow graph of L coincides with its (2,4) arrow graph, and 
is depicted in Figure 5. 

Due to the construction of the (2,4)-arrow graph, other results of [8] can 
be extended from languages in S'S'iJ(l,3) to languages in the class SSH{2,4). 
We give the following, without proofs, since the proofs in the (1,3) case are 
all based on the prime (1,3) arrow graph construction, and can be replaced by 
similar proofs, based on the prime (2,4) arrow graph. 

First, some relations with the notion of constant introduced by 
Schiitzenberger [16]. A constant of a language L C F* is a string c G F* such 
that for all a;, y, x' , y' G F*, if xcy and x'cy' are in L, then xcy' is in L. The rela- 
tion with the splicing operation is obvious, since xcy' = (c, l;c, l){xcy,x'cy') = 
(1, c; 1, c){xcy, x'cy'). Thus c is a constant of L iff r{L) C L for r = (c, 1; c, 1), or 
iff r'{L) C L for r' = (l,c; l,c). 

Theorem 4. A language L C V* is a simple splicing language (in the class 
SH =SH{1,3) = SH{2,4)) iff there exists an integer K such that every factor 
of L of length at least K contains a symbol which is a constant of L. 

There are many constants in the (1,3) and the (2,4) semi-simple cases. 

Theorem 5. If L G SSH{1,3) U S'5'iF(2,4) then there exists a positive integer 
K such that every string in V* of length at least K is a constant of L. 
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For L e SSIf(l,3) the result is proved in [8]. For L G SSII(2,4) the same 
proof holds, replacing prefix edge with suffix edge. 

According to [7], a language L is strictly locally testable if and only if there 
exists a positive integer K such that every string in V* of length at least K is a 
constant of L. We have thus: 

Corollary 3. (extension of Corollary 30.2 of [8]) If L is a (1,3) or a (2,4)- 
semi-simple splicing language, then L is strictly locally testable. 

5 A Hierarchy of Classes of Semi-simple Languages 

To illustrate the number and variety of open problems in this area, let us note 
that there is no connection formulated yet between the semi-simple languages 
as introduced by Goode and Pixton in [81, and the classes introduced by Tom 
Head in [10], 

Recall that, inspired by the study of simple splicing languages introduced in 
[12] , Head introduced the families of splicing languages {SkH \ k > — 1} using the 
null context splicing rules defined in [9] . A splicing rule, according to the original 
definition of [9], is a sixtuple (ui,x, U2] us, x, U4) (x is called the cross-over of the 
rule) which acts on strings precisely as the (Paun type) rule {u\x , U2] u^x , ua) . 
(For a detailed discussion of the three types of splicing - definitions given by 
Head, Paun, and Pixton ~ and a comparison of their generative power, we send 
the reader to [1], [2], [3], [4].) 

A null context splicing rule is a rule of the particular form (1, x, 1; 1, x, 1) 
(all contexts Ui are the empty string 1) and thus will be precisely a rule of type 
(x, 1; X, 1) with X G V*. Such a rule is identified by Tom Head with the string 
X G V*. For a fixed k > —1, a splicing system (V, A, R) with rules {x, 1; x,l) G R 
such that |a;| < fc is called an SkH system, and languages generated by an SkH 
system are the SkH languages (more appropriately, SkH{l,3) languages). 

Note that the class SiH{l,3) is precisely the class SH of simple H systems 
of [12]. 

For fc = — 1, the S-iH{l,3) systems are precisely those for which the set of 
rules R is empty. Thus the languages in the class S-iH{l,3) are precisely the 
finite languages, i.e., S'_iH(l,3) = FIN. 

For fc = 0, the SqH{1, 3) systems have either R empty, or R = {(1, 1; 1, 1)}. 
Thus infinite languages in the class SqH (1,3) are generated by H systems of 
the form S = {V,A, {(1, 1; 1, 1)}), with A yf 0, and it can be easily shown that 
L{S) = alph{A)*, where alph{A) denotes the set of symbols appearing in strings 
of A. 

The union of all classes SkH is precisely the set of null context splicing lan- 
guages. It is shown in [10] that this union coincides with the family of strictly 
locally testable languages (in the sense of [7]). Also from [10] we have the fol- 
lowing result: 

Lemma 10. The sequence {SkH{l,3) \ fc > —1} is a strictly ascending infinite 
hierarchy of language families for alphabets of two or more symbols. 
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Proof. The inclusions are obvious, their strictness is established via an example: 

Example 4 [10] Let V = {a,b}. For each k > —1 let Lk = {b^ a^)* . 

Note that = L{V, Ak, Rk) where Ak = and Rk = 

(a^,l;a^,l). Thus Lk G SkH{l,5) and is not in SjH{l,3) for any j < k. □ 

A “semi-simple” extension of the above concepts would be to consider rules 
of type (ui, 1 ; M 3 , 1 ) with “radius” k, i.e., |mi|, |m 3 | < k (as suggested also in [ 12 ]). 
Let us call such a rule a k-fat semi-simple rule of type (1, 3). For a fixed k > — 1, 
a splicing system (V, A, R) with rules r G i? of the above form is called a k-fat 
semi-simple (1,3) H system, or briefly an SSkH{l,3) system, and languages 
generated by such systems are the SSkH{l,3) languages. 

Note that the SSiH{l, 3) class is precisely the class SSH{1, 3) considered in 

[ 8 ], 

For k = —1, similarly to the simple case, the SS-iH{l,3) systems are 
precisely those for which the set of rules R is empty, and thus the languages 
in the class S'S'_iiL(l, 3) are precisely the finite languages, S'S'_iiL(l, 3) = 
S-iH{l,3) = FIN. 

For fc = 0, again reasoning as in the simple case, we have that SSoH{l,3) 
systems have either R empty, or R = {(1,1;1,1)}. Thus we have the equality 
of classes, SSqH{1,3) = SoH{l,3), and thus SSoH{l,3) contains only finite 
languages and languages of the form W* for some subalphabet W CV. 

We have the following analogue of Lemma 10: 

Lemma 11. The sequence {SSkH{l, 3) | fc > —1} is a strictly ascending infinite 
hierarchy of language families for alphabets of three or more symbols. 

Proof. The inclusions are obvious, their strictness follows from the example be- 
low. 

Example 5 Let V = {a,b,c}. For each k > —1 let 

Lk = a'=(6'=a('=-i)6'=a'=)+ U c'=(6'=a('=-i)fc'=a'=) + . 

Note that Lk = L(V,Ak,Rk) where Ak = {a’^b'^a^'^-^^b’^a'^ , c'^b'^a^’^-^'fb'^a’^} and 
Rk = (a^jl;c^,l). Thus Lk G SSkH{l,3) and is not in SSjH{l,3) for any 
j < k. □ 

It is unknown whether the result above holds for an alphabet of two symbols. 
It remains to be investigated whether other results obtained in [10] for SkH 
languages can be extended to SSkH languages. The relationship with the hier- 
archy of regular languages in [13] also remains to be studied. 

Considering other types (i,j), different from (1,3), is also worth exploring 
further. 

For two words u,v € V*, such that |m|, |v| < fc, a semi-simple k-fat (linear) 
splicing rule, with markers u and v, is a splicing rule of one of the following four 
types: (m, 1 ; v, 1 ), ( 1 , u; 1 , v), ( 1 , u; v, 1 ), (m, 1 ; 1 , v). The corresponding SSkH{i,j) 
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systems and languages can be defined as usual. For instance, a k-fat semi-simple 
(2,4) H system, or briefiy an SSkH{2,4) system, is an H system with rules of 
the form (l,a;; l,y), with x,y € V*, and |x|, \y\ < k. For every fc > — 1 we will 
have the corresponding class of languages, SSkH{2,‘i). One can readily prove 
analogues of Proposition 1 and Corollary 1 for arbitrary k > —1. 

Proposition 2. Letu,v G V* he two strings. The following eguality of functions 
from P* X V* to 2'^* holds: 

{u, 1; V, l)mi = inv{mi x mi)(l, mi{v); 1, mi{u)). 

Corollary 4. For every k > —1, there exists a bijection between the classes of 
languages SSHk{l, 5) and SSHk{2,4). 

The bijection between FI systems will be given by standard passing from 
a /c-fat (1,3) rule (m, l;r;, 1), to the k-i&i (2,4) rule {l,mi{v)]l,mi{u), and by 
mirroring the axiom set. On languages it will again be the mirror image. 

Using this fact, an analogue of Lemma 11 can be readily proved for the 
sequence {SSkH{2,4) \ k > —1}. 

6 Simple Circular H Systems 

Splicing for circular strings was considered by Tom Head in [9] . In [11] the circular 
splicing operation which uses a rule of the general type r = (mi, M 2; Ms, U4) is 
defined by: 

('xMiM 2 , "J/M 3 M 4 ) \~r '~XUiU4yU3U2 

and is depicted in Figure 6. 



XU1U2 J/M3M4 



XU\U4,yU2,U2 




Fig. 6. Circular splicing 



We mention the following easy to prove properties of circular splicing, which 
are not shared with the linear splicing. 

Lemma 12. For every splicing rule (mi, M2; M3, M4), and every "x,"y,"z G V° 
such that Cx,^y) F(„i,„2;«3,«4) have: 

the length preserving property: |'a:| + pyj = j'zj, 

the symbol preserving property: Yx\b + \‘'y\b = \"z\b for every b G V. 
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For a fixed symbol a € V {a, marker) we can consider four types of simple 
circular splicing rules: (a, 1; a, 1), (1, a; 1, a), (1, a; a, 1), (a, 1; 1, a). Their action 
on circular strings is illustrated below: 

tgpe(l,3) : ("xa/ga) l~(a,i;a,i) ''xaga, 
type(2,4) : Cax,'^ag) l-(i_o;i,a) "axay, 
type{2,3) : Cax,^ya) ^axya, 

type{l,4): {^xa/ay) ^xaay. 

For z = 1,2 and j = 3,4 an {i, j)-simple circular H system is a circular H 
system S = {V, A, R)^ with the set of rules R consisting only of simple rules of 
type (z, j) for some a €V. 

We denote by SH°{i,j) the class of all (z, j)-simple circular languages. Only 
two out of these four classes are distinct. 

Theorem 6. (See Theorems 4-1 and 4-2 of [5]) We have: 

(i) SH°{1,3) = SH°{2,4). 

(ii) SH°{2,3) = SH°{1,4). 

(Hi) The classes SH°{1,3) and SH°(2,3) are incomparable. 

Several other properties which emphasize the difference between the case 
of linear simple splicing and that of circular simple splicing (for instance the 
behavior over the one-letter alphabet) have been presented in [5]. 

We have SH°{1, 3) C REG° as proved in [5], and in [17] in a different context. 
The relationship between SH° (1,4) and REG° is still an open problem. 

We give next some closure properties of SH°{1, 3). 

Theorem 7. We have the following: 

(i) SH°(1,3) is not closed under union. 

(ii) SH°{1,3) is not closed under intersection with REG° . 

(Hi) SH°{1,3) is not closed under morphisms. 

(iv) SH°{1,3) is not closed under inverse morphisms. 

Proof. For assertion (i), take Li = {"a” | zz > 1} G SE[°{1, 3), and L 2 = {'(a6)" | 
n > 1} G SP[°{1, 3). Suppose L 1 UL 2 G SP[°{1, 3), then it would be closed under 
some (1, 3) rules. But with rule (o, 1; a, 1) we get: 

('(o)", '•{abr) A^iabr ^ U L 2 , 

and then, the only simple splicing rule which could be used to generate Li\J L 2 , 
which is infinite, would be (5,1; 5,1). But the words "a” cannot be generated 
using this rule. 

For (ii), take V° G SH°{1,3) and L = {'a"5 | rz > 1} which is in REG° but 
not in SH° {1,3). We have V° C\ L = L ^ SH°{1,3). 

For (iii), take L = {"a" | n > 1} U {"5” | rz > 1} G SP[°{1,3). Take the 
morphism h with 5(a) = a and 5(5) = ab. Then h{L) = {"a” | rz > 1} U {'(a5)" | 
rz > 1} ^ SP[°{1, 3) (as proved in (i) above). 

Fot (iv), take V = {a, 5} and L = {'a} G EIN° C SH°{1,3). Take morphism 
5 defined by 5(a) = a, 5(5) = 1. Then h~^{L) = {'5”a | rz > 0} ^ SE[°{1, 3). □ 
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7 Semi-simple Circular Splicing Systems 

For fixed letters a,b € V, we will consider the following four types of semi-simple 
circular splicing rules: (a, 1; 6, 1), (1, a; 1, b), (1, a; b, 1), (a, 1; 1, b). 

The action of the four rules on circular strings is depicted in Figures 7, 8, 9 
and 10. 






xayb 






h 



(a,l;&,l) 



y 




Fig. 7. Circular splicing using the semi-simple rule (a,l;b,l) 




Fig. 8. Circular splicing using the semi-simple rule (l,a;l,b) 




Fig. 9. Circular splicing using the semi-simple rule (l,a;b,l) 



We will consider the semi-simple splicing rules as functions from V° x V° to 
2^ . Using the mirror image function we can express some relationships between 
the four types of rules seen as functions. 




On Some Classes of Splicing Langnages 



103 



" by ' xaby 




Fig. 10. Circular splicing using the semi-simple rule (a,l;l,b) 

Lemma 13. The following equalities of functions from V° x Vf to 2'^ hold: 

(a, 1; b, l)mi = {mi x a; 1, 5), 

(a, 1; 1, b)mi = {mi x mi){l, a; b, 1). 



Theorem 8. The classes SSH°{1,3) and SSH°{2,4:) are incomparable. 

Proof. Note first that simple circular splicing languages (in SH°{1,3) = 
SH°{2,4)) are in the intersection SSH°{1,3) n SSH°{2,4). 

The semi-simple H system = ({a, 6, c|, {'aacC^I, l(c, 1; 6, 1)1) generates 
a language Li = L{Si) G SSH°{1,3) \ SSH°{2,4). 

Note that the words of Li have the following two properties: 

(I) All words in Li, with the exception of the two axioms, have occurrences 
of all three letters a, b, and c. 

(II) All occurrences of a in the words of Li are in subwords of the form aac. 
Now, suppose Li is (2,4) generated. Then it would be respected by one or 

several of the rules of the forms: 

(i) (l,c; 1,6), (ii) (1,6; l,c), (hi) (1, a; 1, 6), (iv) (1, 6; 1, a), 

(v) (l,c;l,a), (vi) (l,a; l,c), (vii) (1, a; 1, a), (viii) (1, 6; 1, 6), (ix) 
(l,c; l,c). 

But we have: 

(v) ('aac, "aac) c;i,o) " aacaca ^ Li (no 6’s), 

(vi) ("aac, 'aac) ^,) " aacaca ^ Li (no 6’s), 

(vii) ('aac, 'aac) b(i " caacaa ^ Li (no 6’s), 

(ix) {" aac," aac) " caacaa ^ Lx (no 6’s), 

(viii) ('6, '6) l“(i "bb^ Lx (no a’s, no c’s), 

(i) {" aac," b) b(ic;i,6) " caab ^ Lx (contradicts (II)), 

(ii) {" b," aac) "bcaa^ Lx (contradicts (II)), 

(iii) {" aac," b) " acab ^ Lx (contradicts (II)), 

(iv) {" b," aac) b(i ^;i,a) "baca ^ Lx (contradicts (II)). 

The semi-simple H system S 2 = ({a, 6, c}, {'aac, ' 6}, {(1, c; 1,6)}) generates 
a language L 2 = L{S 2 ) G SSH°{2,4) \ SSH°{1,3). The proof that L 2 is not in 
SSH°{1, 3) is similar to the above one. We note that the words of L 2 have the 
following properties: 




104 



Rodica Ceterchi, Carlos Martm-Vide, and K.G. Subramanian 



(I) All words in L2, with the exception of the two axioms, have occurrences 
of all three letters a, b, and c. 

(II) All occurrences of a in the words of L2 are in subwords of the form caa. 
If we suppose L2 is (1,3) generated, then it would be respected by one or 

several of the rules of the forms: 

(i) (c, 1;5, 1), (ii) (6, l;c, 1), (iii) (a, 1; 6, 1), (iv) (6, 1; a, 1), 

(v) (c, l;a, 1,) (vi) (o, l;c, 1), (vii) (a, l;a, 1), (viii) (6,1; 6,1), (ix) 

(c, l;c, 1). 

But we have: 

(v) aac," aac) b(c4;a,i) " aacaca ^ L2 (no 6’s, contradicting (I)), 

(vi) aac,'' aac) !“(„ 1.^,4) " acaaac ^ L2 (no 6’s), 

(vii) aac,'' aac) b(a4;a,i) " caacaa ^ L2 (no 6’s), 

(ix) aac," aac) " caacaa ^ L2 (no 6’s), 

(viii) ('6, "6) '66^ L2 (no a’s, no c’s), 

(i) ("aaCj^b) " aacb ^ L2 (contradicts (II)), 

(ii) b," aac) "baac^ L2 (contradicts (II)), 

(iii) aac,"b) " acab ^ L2 (contradicts (II)), 

(iv) b," aac) b({, "baca ^ L2 (contradicts (II)). □ 

8 Conclusions and Further Research 

We have introduced some particular classes of splicing languages, continuing the 
studies of [12], [8], and [10]. Our emphasis was on classes associated to types, 
different from the (1,3) type, which has been more extensively studied. We have 
also proposed extensions of these concepts to the circular case, continuing the 
research started in [5] . We have pointed out several open problems which remain 
to be investigated. 

Following the lines of the extensive study in [12], many other problems can 
be formulated and investigated for the other classes as well: closure properties, 
decidability problems, descriptional complexity questions, characterization prob- 
lems. The interesting notion of arrow graph is also worth exploring further. In 
particular, the behavior of the classes which arise from considering splicing rules 
of types (2, 3) and (1, 4) is expected to deviate from the “standard” of the (1, 3) 
and (2,4) classes. 

Acknowledgement. Part of the material in this paper was presented at 
the First Joint Meeting RSME-AMS, Sevilla, June 2003, as the conference talk 
entitled “Another Class of Semi-simple Splicing Languages” . 
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Abstract. The notion of a network of Watson-Crick DOL systems was 
recently introduced, [7]. It is a distributed system of deterministic lan- 
guage defining devices making use of Watson-Crick complementarity. 
The research is continued in this paper, where we establish three re- 
sults about the power of such networks. Two of them show how it is 
possible to solve in linear time two well-known NP-complete problems, 
the Hamiltonian Path Problem and the Satisfiability Problem. Here the 
characteristic feature of DNA computing, the massive parallelism, is used 
very strongly. As an illustration we use the propositional formula from 
the celebrated recent paper, [3]. The third one shows how in the very 
simple case of four-letter DNA alphabets we can obtain weird (not even 
Z-rational) patterns of population growth. 



1 Introduction 

Many mathematical models of DNA computing have been investigated, some 
of them already before the fundamental paper of Adleman, [1]. The reader is 
referred to [11,2] for details. Networks of language generating devices, in the 
sense investigated in this paper, were introduced in [5,6,7]. This paper continues 
the work begun in [7]. The underlying notion of a Watson-Crick DOL system was 
introduced and studied further in [10,8,16,17,18,20,4]. 

The reader can find background material and motivation in the cited ref- 
erences. Technically this paper is largely self-contained. Whenever need arises, 
[14,13] can be consulted in general matters about formal languages, [12] in mat- 
ters dealing with Lindenmayer systems, and [19,9] in matters dealing with formal 
power series. 

Watson-Crick complementarity is a fundamental concept in DNA computing. 
A notion, called Watson-Crick DOL system, where the paradigm of complemen- 
tarity is considered in the operational sense, was introduced in [10]. 
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A Watson-Crick DOL system (a WDOL system, for short) is a DOL system over 
a so-called DNA-like alphabet A and a mapping (j) called the mapping defining 
the trigger for complementarity transition. In a DNA-like alphabet each letter 
has a complementary letter and this relation is symmetric. is a mapping from 
the set of strings (words) over the DNA-like alphabet A to {0, 1} with the fol- 
lowing property: the ())-value of the axiom is 0 and whenever the Rvalue of a 
string is 1, then the ())- value of its complementary string must be 0. (The com- 
plementary string of a string is obtained by replacing each letter in the string 
with its complementary letter.) The derivation in a Watson-Crick DOL system 
proceeds as follows: when the new string has been computed by applying the 
homomorphism of the DOL system, then it is checked according to the trigger. 
If the (()- value of the obtained string is 0 (the string is a correct word), then 
the derivation continues in the usual manner. If the obtained string is an in- 
correct one, that is, its i?i^value is equal to 1, then the string is changed for its 
complementary string and the derivation continues from this string. 

The idea behind the concept is the following: in the course of the computation 
or development things can go wrong to such extent that it is of worth to con- 
tinue with the complementary string, which is always available. This argument is 
general and does not necessarily refer to biology. Watson-Crick complementarity 
is viewed as an operation: together with or instead of a word w we consider its 
complementary word. 

A step further was made in [7]: networks of Watson-Crick DOL systems were 
introduced. The notion was a particular variant of a general paradigm, called 
networks of language processors, introduced in [6] and discussed in details in [5]. 

A network of Watson-Crick DOL systems is a finite collection of Watson-Crick 
DOL systems over the same DNA-like alphabet and with the same trigger. These 
WDOL systems act on their own strings in a synchronized manner and after each 
derivation step communicate some of the obtained words to each other. The 
condition for communication is determined by the trigger for complementarity 
transition. Two variants of communication protocols were discussed in [7]. In 
the case of protocol (a), after performing a derivation step, the node keeps every 
obtained correct word and the complementary word of each obtained incorrect 
word (each corrected word) and sends a copy of each corrected word to every 
other node. In the case of protocol (6), as in the previous case, the node keeps 
all the correct words and the corrected ones (the complementary strings of the 
incorrect strings) but communicates a copy of each correct string to each other 
node. The two protocols realize different strategies: in the first case, if some error 
is detected, it is corrected but a note is sent about this fact to the others. In the 
second case, the nodes inform the other nodes about their correct strings and 
keep for themselves all information which refers to the correction of some error. 

The purpose of this paper is to establish three results about the power of such 
networks, where the trigger (p is defined in a very natural manner. The results are 
rather surprising because the underlying DOL systems are very simple. Two of 
them show how it is possible to solve in linear time two well-known NP-complete 
problems, the Hamiltonian Path Problem and the Satisfiability Problem. Here 
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the characteristic feature of DNA computing, the massive parallelism, is used 
very strongly. As an illustration we use the formula from the celebrated recent 
paper, [3]. The third result shows how in the very simple case of four-letter 
DNA alphabets we can obtain weird (not even Z-rational) patterns of population 
growth. 

2 Definitions 

By a DOL system we mean a triple H = (S, g,wo), where S is an alphabet, g is 
an endomorphism defined on S* and wg G U* is the axiom. The word sequence 
S(JI) of JI is defined as the sequence of words wg, wi,W 2 , ■ ■ ■ where Wi+i = g(wi) 
for f > 0. 

In the following we recall the basic notions concerning Watson-Crick DOL 
systems, introduced in [10,16]. 

By a DNA-like alphabet U we mean an alphabet with 2n letters, n > 1, 
of the form U = {ai, . . . , a„, di, . . . , d„}. Letters and Ui, 1 < i < n, are 
said to be complementary letters; we also call the non-barred symbols purines 
and the barred symbols pyrimidines. The terminology originates from the basic 
DNA alphabet {A, G, C,T}, where the letters A and G are for purines and their 
complementary letters T and C for pyrimidines. 

We denote by hw the letter-to-letter endomorphism of a DNA-like alphabet 
S mapping each letter to its complementary letter, hw is also called the Watson- 
Crick morphism. 

A Watson-Crick DOL system (a WDOL system, for short) is a pair W = 
where H = {E,g,wg) is a DOL system with a DNA-like alphabet S, 
morphism g and axiom wg G , and </> : E* —>■ {0, 1} is a recursive function 
such that </>(ico) = = 0 and for every word u G S* with (j){u) = 1 it holds 

that (j){hwiu)) = 0. 

The word sequence S'(IT) of a Watson-Crick DOL system W consists of words 
Wg,wi,W 2 j ■ ■ ■ , where for each z > 0 

_ / 9{wi) if </>(g(wi)) = 0 
t hn,{g{wi)) if (t>{g{wi)) = 1. 

The condition = 1 is said to be the trigger for complementarity transi- 
tion. In the following we shall also use this term: a word w G E* \s called correct 
according to (j) if (j){w) = 0, and it is called incorrect otherwise. If it is clear from 
the context, then we can omit the reference to 4>. 

An important notion concerning Watson-Crick DOL systems is the Watson- 
Crick road. Let W = {H, (j)) be a Watson-Crick DOL system, where H = 
(E, g, Wg). The Watson-Crick road of W is an infinite binary word a over {0, 1} 
such that the zth bit of a is equal to 1 if and only if at the zth step of the computa- 
tion in IT a transition to the complementary takes place, that is, 4>{g{wi-i)) = 1, 
z > 1, where wg,w\,W 2 , ■ . . is the word sequence of W. 

Obviously, various mappings (j) can satisfy the conditions of defining a trig- 
ger for complementarity transition. In the following we shall use a particu- 
lar variant and we call the corresponding Watson-Crick DOL system standard. 
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In this case a word w satisfies the trigger for turning to the complemen- 
tary word (it is incorrect) if it has more occurrences of pyrimidines (barred 
letters) than purines (non-barred letters). Formally, consider a DNA-like al- 
phabet E = {ai, . . . , a„, ai, . . . , a„} , n > 1. Let Epur = {ai,...,a„} and 
EpYR = {hi, ■ • • , h„}. Then, we define (f) : E* ^ (0, 1} as follows: for w G E* 

d>(w) = / ^ l^|.Sp( 7 K ^ and 

i 1 if |tt^|.Sp(7K < Itcjiipyp. 

Following [7], we now define the basic notion in this paper, a network of 
Watson-Crick DOL systems (an NWDOL system). It is a finite collection of WDOL 
systems over the same DNA-like alphabet and with the same trigger, where the 
component WDOL systems act in a synchronized manner by rewriting their own 
sets of strings in the WDOL manner and after each derivation step communicate 
some of the obtained words to each other. The condition for communication is 
determined by the trigger for complementarity transition. 

Definition 1 By an Nj-WDOL system (a network of Watson-Crick DOL sys- 
tems) with r components or nodes, where r >1, we mean an r -\- 2-tuple 



r = {E, (j), (gi, |Ai}), . . . , {gr, |Ap})), 

where 

— E = joi, . . . , a„, hi, ... , dn}, n > 1, is a DNA-like alphabet, the alphabet of 
the system, 

— (j> : E* ^ {0,1} is a mapping defining a trigger for complementarity transi- 
tion, and 

— {gi,{Ai}), 1 < i < r, called the ith component or the ith node of D, is a pair 
where gi is a DOL morphism over E and Ai is a correct nonempty word over 
E according to 4>, called the axiom of the ith component. 

If the number of the components in the network is irrelevant, then we speak 
of an NWDOL system. An NWDOL system is called standard if (p is defined in 
the same way as in the case of standard WDOL systems. 

Definition 2 For an Nr WDOL system F = {E,(j),{gi,{Ai}), . . . ,{gr,{Ar})), 
r > 1, the r-tuple {Li, . . . , Lr), where Li, 1 < i < r, is a finite set of correct 
strings over E according to 4>, is called a state of F. Li, 1 < i < r, is called the 
state or contents of the ith component. ({Ai}, . . . , {Ar}) is said to be the initial 
state of F. 

NWDOL systems change their states by direct derivation steps. A direct 
change of a state to another one means a rewriting step followed by communi- 
cation according to the given protocol of the system. In the following we define 
two variants of communication protocols, representing different communication 
philosophies. 
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Definition 3 Let si = {L \, . . . , Lr) and S 2 = {L '^, . . . , L'^) he two states of an 
Nr WOOL system F = (gi, {Ai}), . . . , (g^, {-4r})), r > 1. 

— We say that Si directly derives S 2 by protocol (a), written as Si =^a S 2 > if 

L[ = Ci 

where 

Ci = {9i{v) I V G Li,(j){gi{v)) = 0} and 
= {gj{u) I u G Lj,4>{gj{u)) = 1}. 

~ We say that si directly derives S 2 by protocol (b), written as Si S 2 , if 

L[ = hUB') C', 



where 

B[ = {gi{u) I M G Lj, 4>{gi{u)) = 1} and 
C'j = {gj{v) I V G Lj,(j){gj{v)) = 0} holds. 

Thus, in the case of both protocols, after applying a derivation step in the 
WDOL manner the node keeps the correct words and the corrected words (the 
complementary words of the incorrect ones), and in the case of protocol (a) it 
sends a copy of every corrected word to each other node, while in the case of 
protocol (6) it communicates a copy of every correct word to each other node. 
The two protocols realize different communication strategies: In the first case 
the nodes inform each other about the correction of the detected errors, while in 
the second case the nodes inform each other about the obtained correct words. 

Definition 4 Let F = {F,(j),{gi,{Ai}), . . . ,{gr,{Ar})), for r > 1, be an 
Nr WDOL system with protocol (x), x G {a, 6}. 

The state sequence S{F) of F, S{F) = s(0), s(l), . . . , is defined as follows: 
s(0) = ({^ 1 }, ■ . . , {Glr}) and sft) s{t + 1) for t > 0. 

The notion of a road is extended to concern NWDOL systems and their com- 
ponents in the natural fashion. For formal details we refer to [7]. 

3 Satisfiability Problem 

It is well known that the satisfiability problem SAT of propositional formulas 
is NP-complete, [13,15]. We will consider propositional formulas a in conjunc- 
tive (not necessarily 3-conjunctive) normal form. Thus a is a conjunction of 
disjunctions whose terms are literals, that is, variables or their negations. The 
disjunctions are referred to as clauses of a. We assume that a contains v variables 
< i < V, and c clauses. The formula a is satisfiable if there is a truth-value 
assignment for the variables (that is, an assignment of T or F) giving a the 
value T . When we speak of computions (dealing with a) in linear time, we refer 
to functions linear either in v or c. 
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To illustrate our arguments, we use the propositional formula [3 from [3] . A 
DNA computer was constructed in [3] to solve the satisfiability problem of ( 3 . 
The formula j 3 , given below, is in 3-conjunctive normal form, and involves 20 
variables and 24 clauses. 

(~ X3V ~ xiQ V xis) A (x5 V X12V ~ xq) 

A(~ xiaV ~ a;2 V X20) A (X12V ~ a;gV ~ 0:5) 

A(a;i9V ~ X4 V xq) A (xg V X12V ~ X5) 

A(~ Xi V X4V ~ Xii) A (X13V ~ X2V ~ Xig) 

A(x5 V Xi7 V Xg) A (xi5 V XgV ~ X17) 

A(~ X5V ~ XgV ~ X12) A (xg V Xii V X4) 

A(~ X15V ~ Xi7 V X7) A (~ X6 V Xig V X13) 

A(~ X12V ~ Xg V X5) A (xi2 V X1 V X14) 

A(x2o V X3 V X2) A (xioV ~ X7V ~ xs) 

A(~ X5 V XgV ~ X12) A (xisV ~ X20 V X3) 

A(~ xiqV ~ xisV ~ xie) A (xiV ~ xnV ~ X14) 

A(xsV ~ X7V ~ X15) A (~ xg V XieV ~ Xio). 

For each variable Xj, we introduce two auxiliary letters ti and ft. Intuitively, 
ti (resp. fi) indicates that the value T (resp. F) is assigned to Xi. The letter ti 
(resp. fi) is characteristic for the clause C if the variable Xi appears unnegated 
(resp. negated) in C. Thus, the letters /3,/i6,ti8 are characteristic for the first 
clause in the formula f 3 above, whereas the letters /sj/iojfie are characteristic 
for the last clause. 

Theorem 1 The satisfiability problem can be solved in linear time by standard 
NWDOL systems. 

Proof. Consider a propositional formula a in conjunctive normal form, with 
V variables Xi,l < i < v, and c clauses Ci,l < i < c. We construct a standard 
Ng^+c+iWDOL system F as follows. The nodes of F are denoted by 

Mi, Ml, l<i<v, Ni, l<i<c, P. 

The alphabet consists of the letters in the two alphabets 

Cl = {S'i I 0 < f < X - 1} U {i?* I 1 < t < c} U {A, G, G'’} 



and 

V2 = {U,fi I 1 < i < w}, 
as well as of their barred versions. 

The communication protocol (6) will be followed, that is, correct words will 
be communicated. Intuitively, the letters Si are associated to the variables, the 
letters Ri to the clauses, and the letters ti,fi to the truth- values. The letter E 
is a special ending letter, and the letter G a special garbage letter. 
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We now define the DOL productions for each node. We begin with the letters 
of V2. In all nodes I < i < v, as well as in the node P, we have the 

productions 

fj fp 1 < J < 

In all nodes W, 1 < z < c, we have the productions 

tj tjTj, fj fjfj, Tj^X, fj^X, l<j< V, 

except in the case that tj or fj is characteristic for the clause Ci the barred 
letters are removed from the former productions, giving rise to the production 
^3 ^3 fj fj- 

We then define the productions for the remaining letters. Each node Mi (resp. 
M'i), 1 < z < V — 1, has the production Si-\ tiSi (resp. Si-\ fiSi). The 
node My (resp. M') has the production Sy-\ tyR\ (resp. S'„_i — > fyRi). 
Each node W, 1 < z < c — 1, has the production Ri Ri+i- The node Ny has 
the production Ry E. The node P has the production E ^ X. Each letter x, 
barred or nonbarred, whose production has not yet been defined in some node, 
has in this node the production x ^ O'". This completes the definition of the 
standard network P. 

The formula a is satisfiable exactly in case, after u+c+l computation steps, a 
word w over the alphabet V 2 appears in the node P. Each such word w indicates 
a truth- value assignment satisfying a. Moreover, because of the communication, 
each such word w appears in all nodes in the next steps. 

The verification of this fact is rather straightforward. There is only one 
“proper” path of computation. In the first v steps a truth-value assignment 
is created. (Actually all possible assignments are created!) In the next c steps it 
is checked that each of the clauses satisfies the assignment. In the final step the 
auxiliary letter is then eliminated. Any deviation from the proper path causes 
the letter G to be introduced. This letter can never be eliminated. We use the 
production G ^ G" instead of the simple G ^ G to avoid the unnecessary 
communication of words leading to nothing. Thus, we have completed the proof 
of our theorem. □ 

Coming back to our example, the required network possesses 65 nodes. The 
alphabet Vi has 46 and the alphabet V2 40 letters. The productions for the node 
Ml, for instance, are 

^ tiS'i, tj tj, fj fj, l<j< 20, 

and X G^*’ for all other letters x. For the node N\ the productions are 

i?i — > i?2, fa fa, fi6 fl6, ti8 tl8, X XX, 

for other letters x in V 2 - Furthermore, N\ has the production x X, for all 
letters a; in V 2 > and the production x —>■ G^° for all of the remaining letters x. 
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After 46 computation steps the word 

W = /li2/3/4/5/6^7^8/9^10ill^l2/l3/l4^15^16^17/l8/l9/20 

appears in all nodes. This word indicates the only truth- value assignment satis- 
fying the propositional formula (3. 

The assignment and the word w can be found by the following direct ar- 
gument. (The argument uses certain special properties of f3. The above proof 
or the construction in [3] are completely independent of such properties.) The 
conjunction of the clauses involving ~ xg is logically equivalent to ~ xg. This 
implies that Xg must assume the value F. Using this fact and the conjunction of 
the clauses involving ~ X5, we infer that X 5 must assume the value F. (xg and 
X 5 are detected simply by counting the number of occurrences of each variable 
in /?.) From the 9th clause we now infer that a;i7 must have the value T. After 
this the value of each remaining variable can be uniquely determined from a 
particular clause. The values of the variables can be obtained in the following 
order (we indicate only the index of the variable): 

9, 5, 17, 15, 7, 8, 10, 16, 18, 3, 20, 2, 13, 19, 6, 4, 11, 1, 14, 12. 

4 Hamiltonian Path Problem 

In this section we show how another well-known NP-complete problem, namely 
the Hamiltonian Path Problem (HPP) can be solved in linear time by standard 
NWDOL systems. In this problem one asks whether or not a given directed graph 
7 = (V, E), where V is the set of vertices or nodes of 7, and E denotes the set of 
its edges, contains a Hamiltonian path, that is, a path which starting from a node 
Vin and ending at a node Vout visits each node of the graph exactly once. Nodes 
Vin and Vout can be chosen arbitrarily. This problem has a distinguished role in 
DNA computing, since the famous experiment of Adleman in 1994 demonstrated 
the solution of an instance of a Hamiltonian path problem in linear time. 

Theorem 2 The Hamiltonian Path Problem can he solved in linear time by 
standard NWDOL systems. 

Proof. Let 7 = (U, E) be a directed graph with n nodes Vi, . . . , Vn, where 
n > 1, and let us suppose that = Vi and Vout = Vn. We construct a standard 
NW DOLgn+i system F such that any word in the language of F identifies a 
Hamiltonian path in 7 in a unique manner. If the language is the empty set, 
then there is no Hamiltonian path in 7. Moreover, the computation process of 
any word in the language L{F) of F ends in 2n-|- 1 steps. 

Let us denote the nodes of F by Mi, . . . ,M 2„+i, and let the alphabet E 
of F consist of letters ai,$i, for 1 < i < n, Zi, with 1 < i < n, X^, with 
n -I- 1 < fc < 2n, Z, F, $, as well as of their barred versions. 

We follow the communication protocol (6), that is, the copies of the correct 
words are communicated. 
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Now we define the productions for the nodes of F. 

For i with 1 < z < n, the node Mi has the following DOL productions: 
Uj Qj, for 1 < j < n, Zj Zj+i for 1 < j < n — 1, Z ^ Zi; aSi, if 7 
has a directed edge from Vk to hi, for 1 < fc < n, A: ^ z; $ ^ ai$i. 

Each node Mi, where rz + 1 < z < 2n, is given with the following set of DOL 
productions: $„ ^ Xi, and ^ A, for z = rz+1; Xi-i Xi, for rz+2 < i < 2n, 
aj UjCLj for j ^ i — n, where 1 < j < n, dj ^ A, for 1 < j < rz, aj aj, for 
j = i-n. 

Finally, the node M 2 n+i has the following productions: 

X 2 n ^ A, Oi ^ Oi, for 1 < z < rz, di ^ A, for 1 < z < rz. 

Furthermore, the above definition is completed by adding the production 
B ^ F, for all such letters B of E whose production was not specified above. 

Let the axiom of the node Mi be $Z, and let the axiom of the other nodes 
be F. 

We now explain how F simulates the procedure of looking for the Hamil- 
tonian paths in 7 . Nodes Mi, 1 < z < rz, are responsible for simulating the 
generation of paths of length rz in 7 , nodes Mj, n + 1 < j < 2rz, are for deciding 
whether each node is visited by the path exactly once, while node M 2 n+i is for 
storing the result, namely the descriptions of the possible Hamiltonian paths. 
The computation starts at the node Mi, by rewriting the axiom $Z to ai%iZi. 
The strings are sent to the other nodes in F. String ai%iZi refers to the fact 
that the node Vi has been visited, the first computation step has been executed 
- Zi - and the string was forwarded to the node Mk- This string, after arriving 
at the node Mk, will be rewritten to aiFZ 2 if there is no direct edge in 7 from 
Vi to Vk, otherwise the string will be rewritten to aiUk$kZ 2 , and the number of 
visited nodes is increased by one (.^ 2 )- Meantime, the strings arriving at nodes 
Ml , with n + 1 < I < 2n + \, will be rewritten to a string with an occurrence 
of F, and thus, will never represent a Hamiltonian path in 7 . Repeating the 
previous procedure, at the nodes Mi, for 1 < z < rz, after performing the first k 
steps, where fc < rz, we have strings of the form uUiai$iZk, where tr is a string 
consisting of letters from {oi, . . . , a„}, representing a path in 7 of length k — 2, 
and the string will be forwarded to the node Mi. After the rzth step, strings of 
the form where r; is a string of length rz over {ai, . . . , a„} representing 

a path of length rz in 7 , will be checked at nodes of Mj, n + 1 < j < 2n, to 
find out whether or not the paths satisfy the conditions of being Hamiltonian 
paths. At the step rz -I- 1, the string at the node Mn+i will be rewritten 

to v'Xn+i, where Xn+i refers to that the (rz -I- l)st step is performed, and v' is 
obtained from v by replacing all letters aj, 1 < J < rz, j yf 1 , with ajCij, whereas 
the possible occurrences of oi are replaced by themselves. If the string does not 
contain ai, then the new string will not be correct, and thus its complement 
contains the letter Xn+i. Since letters Xj are rewritten to the trap symbol F, 
these strings will never be rewritten to a string representing a Hamiltonian path 
in 7 . By the form of their productions, the other nodes will generate strings 
with an occurrence of F at the (rz-l- l)st step. Continuing this procedure, during 
the next rz — 1 steps, in the same way as above, the occurrence of the letter ai. 
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2 < z < n, is checked at the node M„+i. Notice that after checking the occur- 
rence of the letter Ui, the new strings will be forwarded to all nodes of T, so the 
strings representing all possible paths in 7 are checked according to the criteria 
of being Hamiltonian. After the (2n)th step, if the node M2„ contains a string 
vX2n, where w = 07 . . . ai^, then this string is forwarded to the node M2n+i and 
is rewritten to v. It will represent a Hamiltonian path in 7, where the string of 
indices zi , . . . , z„ corresponds to the nodes , . . . , , visited in this order. □ 



5 Population Growth in DNA Systems 



The alphabet in the systems considered above is generally bigger than the four- 
letter DNA alphabet. We now investigate the special case of DNA alphabets. 
By a DNA system, [18], we mean a Watson-Crick DOL system whose alphabet 
equals the four-letter DNA alphabet. In this section we consider networks based 
on DNA systems only. It turns out that quite unexpected phenomena occur in 
such simple networks. For instance, the population growth can be very weird, a 
function that is not even Z- rational, although the simplicity of the four-letter 
nodes might suggest otherwise. 

We will show in this section that it is possible to construct standard networks 
of DNA systems, where the population growth is not Z-rational. (By the popu- 
lation growth function f{n) we mean the total number of words in all nodes at 
the nth time instant, n > 0. We refer to [7] for formal details of the definition.) 

The construction given below can be carried out for any pair (p, q) of different 
primes. We give it explicitly for the pair (2,5). The construction resembles the 
one given in [18]. 

Theorem 3 There is a network of standard DNA systems, consisting of two 
components, whose population growth is not Z -rational. This holds true inde- 
pendently of the protocol. 

Proof. Consider the network T of standard DNA systems, consisting of two 
components defined as follows. The first component has the axiom TG and rules 

A^ A, G^G, T^T^, G^G^. 

The second component has the axiom A and rules 

A ^ A, G^G, T^T, C^G. 



Clearly, the second component does not process its words in any way. 
The beginning of the sequence of the first component is 

TG, A^G, T^G^, T^G®, A®C®, 

rjil6Q25 ^32^25 j^32^125 rji64Ql25 ^128^125 

2^128^625 rji256Q625 rji512Q625 ^1024^625 
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We have indicated by boldface the words not obtained by a complementarity 
transition. 

The beginning of the road of the first component is 

110110110110011 ... 

Denote by j > 1, the jth bit in the road. 

Consider the increasing sequence of numbers, consisting of the positive pow- 
ers of 2 and 5: 

2, 4, 5, 8, 16, 25, 32, 64, 125, 128, 256, 512, 625, 1024, . . . 

For j > 1, denote by m,j the jth number in this sequence. 

The next two lemmas are rather immediate consequences of the definitions 
of the sequences Vj and rrij. 

Lemma 1 The road of the first component begins with 1 1 and consists of occur- 
rences of 11 separated by an occurrence of 0 or an occurrence of 00. 

The distribution of the occurrences of 0 and 00 depends on the sequence mj : 

Lemma 2 For each j > 1, rj = = 0 exactly in case mj = 2mj-i = 4mj-2- 

The next lemma is our most important technical tool. 

Lemma 3 The bits rj in the road of the first component do not constitute an 
ultimately periodic sequence. 

Proof of Lemma 3. Assume the contrary. By Lemma 2 we infer the existence of 
two integers a {initial mess) and p {period) such that, for any f > 0, the number 
of powers of 2 between and 5 “+(*+i)p is constant, say q. Let 2^ be the 

smallest power of 2 greater than 5“. Thus, 

5“ < 2** < 2'’+^ < . . . < < 5“+^’ < 2*'+« 



and, for all i > 1, 



2 b-i-iq—l ^ ga-l-ip ^ 2^+®"? 



Denoting a = we obtain 



b-\- iq — 1 < a{a -T ip) < b-h iq 



and further 

b — 1 aa b 

q H ^ — < pa -\ < q-\- -. 

i i i 

Considering large enough i, we infer q = pa, which is impossible. This contra- 
diction proves Lemma 3. 

We now return to the main proof of Theorem 3. We assume that the com- 
munication protocol is (b), where corrected words are communicated. The proof 
in the other case is a similar application of Lemma 3 and is left to the reader. 
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Denote the population growth function by f(i), z > 0. Thus, the first few 

numbers in the population growth sequence are: 

2 , 3 , 4 , 4 , 5 , 6 , 6 , 7 , 8 , 8 , 9 , 10 , 10 , 10 , 11 , . . . 

Assume now that /(z) is Z-rational. Then also the function 

g{i) = /(*) - /(* - 1 ). * > 1 . 

is Z-rational. But clearly g{i) = 0 exactly in case n = 0. Consequently, by the 

Skolem-Mahler-Lech Theorem (see [19], Lemma 9.10) O’s occur in the sequence 

Ti in an ultimately periodic fashion. But this contradicts Lemma 3 and, thus. 

Theorem 3 follows. □ 
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Abstract. We show that the maximal set commuting with a given reg- 
ular set - its centralizer - can be defined as the maximal fixed point of 
a certain language operator. Unfortunately, however, an infinite number 
of iterations might be needed even in the case of finite languages. 



1 Introduction 

The commutation of two elements in an algebra is among the most natural 
operations. In the case of free semigroups, i.e., words, it is easy and completely 
understood: two words commute if and only if they are powers of a common 
word, see, e.g., [11]. For the monoid of languages, even for finite languages the 
situation changes drastically. Many natural problems are poorly understood and 
likely to be very difficult. For further details we refer in general to [3], [9] or [6] 
and in connection to complexity issues to [10] and [4]. 

Commutation of languages X and Y means that the equality XY = YX 
holds. It is an equality on sets, however to verify it one typically has to go to 
the level of words. More precisely, for each x G X and y G Y one has to find 
x' G X and y' GY such that xy = y'x'. In a very simple setting this can lead to 
nontrivial considerations. An illustrative (and simple) example is a proof that 
for a two-element set X = {x,y} with xy yf yx, the maximal set commuting 
with X is A+, see [1]. 

One can also use the above setting to define a computation. Given languages 
X and Y, for a word x G X define the rewriting rule 

X x' if there exists x' G X,y,y' gY such that xy = y'x' . 

Let be the transitive and reflexive closure of =^c- What can we say about 
this relation? Very little seems to be known. Natural unanswered (according to 

* Supported by the Academy of Finland under grant 44087 
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our knowledge) questions are: when is the closure of a word x € X, i.e., its orbit, 
finite or is it always recursive for given regular languages X and Y7 

We do not claim that the above operation is biologically motived. However, 
it looks to us that it resembles some of the natural operation on DNA-sequences, 
see [13]: the result is obtained by matching two words and then factorizing the 
result differently. Consequently, it provides a further illustration how computa- 
tionally complicated are the operations based on matching of words. 

Our goal is to consider a particular question on commutation of languages 
without any biological or other motivation. More precisely, we want to intro- 
duce an algebraic approach, so-called fixed point approach, to study Conway’s 
Problem. The problem asks whether or not the maximal language commuting 
with a given regular language X is regular as well. The maximal set is called the 
centralizer of X. An affirmative answer is known only in very special cases, see, 
e.g., [15], [3], [14], [7] and [8]. In general, the problem seems to be very poorly 
understood - it is not even known whether the centralizer of a finite language 
X is recursive! 

We show that the centralizer of any language is the largest fixed point of 
a very natural language operator. Consequently, it is obtained as the limit of 
a simple recursion. When started from a regular X all the intermediate ap- 
proximations are regular, as well. However, as we show by an example, infinite 
number of iterations might be needed and hence the Conway’s Problem remains 
unanswered. 

One consequence of our results is that if Conway’s Problem has an affirmative 
answer, even nonconstructively, then actually the membership problem for the 
centralizer of a regular language is decidable, i.e., it is recursive. 

2 Preliminaries 

We shall need only very basic notations of words and languages; for words see [12] 
or [2] and for languages [17] or [5]. 

Mainly to fix the terminology we specify the following. The free semigroup 
generated by a finite alphabet A is denoted by A+. Elements of A+ are called 
words and subsets of A+ are called languages. These are denoted by lower case 
letters x,y,. . . and capital letters X,Y,.. ., respectively. Besides standard opera- 
tions on words and languages we especially need the operations of the quotients. 
We say that a word u is a left quotient of a word w if there exists a word u such 
that w = uv, and we write v = u~^w. Consequently, the operation {u, v) u~^v 
is a partial mapping. Similarly we define right quotients, and extend both of these 
to languages in a standard way: X~^Y = {x~^y \ x G X,y gY}. 

We say that two languages X and Y commute if they satisfy the equality 
XY = YX. Given an A C A+ it is straightforward to see that there exists the 
unique maximal set C(A) commuting with X. Indeed, 6(A) is the union of all 
sets commuting with A. It is also easy to see that 6(A) is a subsemigroup of 
A~^. Moreover, we have simple approximations, see [3]: 

Lemma 1. For any X C A+ we have A+ C C(A) C Pref(A+) n Suf(A+). 
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Here Pref(X+) (resp. Suf(AT+)) stands for all nonempty prefixes (resp. suf- 
fixes) of A1+. 

Now we can state: 

Conway’s Problem. Is the centralizer of a regular X regular as well? 

Although the answer is believed to be affirmative, it is known only in the 
very special cases, namely when A is a prefix set, binary or ternary, see [15], [3] 
or [7], respectively. This together with the fact that we do not know whether 
the centralizer of a finite set is even recursive, can be viewed as an evidence of 
amazingly intriguing nature of the problem of commutation of languages. 

Example 1. (from [3]) Consider X = {a, ab, ba, bb}. Then, as can be readily seen, 
the centralizer C(A) equals to A+\{&} = {XU{bab, bbb})~^. Hence, the centralizer 
is finitely generated but doesn’t equal either to A+ or {a, 6}+. 

Finally, we note that in the above the centralizers were defined with respect 
to the semigroup A~^ . Similar theory can be developed over the free monoid A* . 



3 Fixed Point Approach 

As discussed extensively in [14] and [8], there has been a number of different 
approaches to solve the Conway’s Problem. Here we introduce one more, namely 
so-called fixed point approach. It is mathematically quite elegant, although at 
the moment it does not yield into breakthrough results. However, it can be seen 
as another evidence of the challenging nature of the problem. 

Let X C A+ be an arbitrary language. We define recursively 

Xq = Pref(A"'‘) n Suf{X'^), and 

A,+i = Ai \ [X~\XX,AX,X) U (AA,AA,A)A-i], for i > 0, (1) 

where A denotes the symmetric difference of languages. Finally we set 

^ 0 = n ( 2 ) 

i>0 



We shall prove 

Theorem 1. Zq is the centralizer of X , i.e., Zq = 6(A). 

Proof. The result follows directly from the following three facts: 

(i) Ai_|_i C Xi for all i > 0, 

(ii) 6(A) C Xi for all i > 0, and 

(iii) ZoX = XZo. 
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Indeed, (iii) implies that Zq C C(X), while (ii) together with (2) implies that 
e(x) C Zo. 

Claims (i)-(iii) are proved as follows. Claim (i) is obvious. Claim (ii) is proved 
by induction on i. The case i = 0 is clear, by Lemma 1. Let z G C(-^) and assume 
that C{X) C Xi. Assume that z ^ Xi+\. Then 

z G X~^{XX,AXiX) U {XXiAX,X)X~'^ . 

Consequently, there exists an a; G A such that 

xz or zx G {XXiAXiX). 

This, however, is impossible since z G C(A) C Xi and C{X)X = XC{X). For 
example, xz is clearly in XXi, but also in XiX due to the identity xz = z'x' 
with z' G C(A), x' G A. So z must be in A^+i, and hence (ii) is proved. 

It remains to prove the condition (iii). If ZqX and XZq were inequal, then 
there would exist a word w G Zq, such that either wX % AZq or Xw % ZqX. 
By symmetry, we may assume the previous case. By the definition of Zq and (i) 
this would mean that begining from some index k we would have wX ^ XXi, 
when i > k. However, w G Zq Q Xi for every z > 0, especially for k, and 
hence AfcA yf AAfc. This would mean that w G {XXkAXkX)X~^ and hence 
w ^ Afc+i, and consequently w ^ Zq, a contradiction. 

Theorem 1 deserves a few remarks. 

First we define the language operator tp by the formula 

: A r \ [X-'^{XYAYX) U {XYAYX)X~^], 

where A is a fixed language. Then obviously all languages commuting with A 
are fixed points of (p, and the centralizer is the maximal one. Second, in the 
construction of Theorem 1 it is not important to start from the chosen Aq. 
Any superset of 6(A) would work, in particular A+. Third, as can be seen by 
analyzing the proof, in formula (1) we could drop one of the members of the 
union. However, the presented symmetric variant looks more natural. 

In the next section we give an example showing that for some languages A 
an infinite number of iterations are needed in order to get the centralizer. In the 
final concluding section we draw some consequenses of this result. 

4 An Example 

As an example of the case in which the fixed point approach leads to an infinite 
iteration we discuss the language A = {a, bb, aba, bab, bbb}. First we prove that 
the centralizer of this language is A"*". To do this we start by proving the following 
two lemmata. We consider the prefix order of A* , and say that two words are 
incomparable if they are so with respect to this order. 

Lemma 2. Let X he a rational language including a word v incomparable with 
other words in X . If w G 6(A), then for some integer n G {0,1,2,...} there 
exist words t G A" and u G Suf(A) such that w = ut and ztA”A* C C(A). 
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Proof. If m G C(^) ^ind v is an incomparable element in X, then equation 
XC{X) = S{X)X implies that vw G C{X)X and therefore vwvf^ G C(X) for 
some element v\ G X. Repeating the argument n times we obtain 

v'^ w{Vn- ■ ■V2Vi)~^ G e{X), Vt G X, 

where t = Vn---V 2 Vi and w = ut. Then v'^u G C(A1) for some integer n G 
{0, 1,2,.. .} and word u G Suf(Al)nPref(w). Since v is incomparable, we conclude 
that for every s G AT" 



v'^usG e(X)X" = X”e(AT), 



and hence 

us G e{X). 

In other words, mAT” C C(X). Since C(A1) is a semigroup, we have also the 
inclusion uX'^X* ce{X). 

For every proper suffix Ui G Suf(X), including the empty word 1, there 
either exists a minimal integer n^, for which mX'^' C C(A1), or UiAl" ^ C(A1) for 
every integer n > 0. Since Lemma 2 excludes the latter case, we can associate 
with every word w G G{X) a word Ui G Suf(A') and the minimal rii such that 
w G u^X^'X*. 

Lemma 3. If the finite language X contains an incomparable word, it has a 
rational centralizer. Moreover, the centralizer is finitely generated. 

Proof. If the language X is finite, then the set of proper suffixes of X is also 
finite. With the above terminology we can write 

e{X) = y u.X'^'X* = (y UiX'^') X* = GX*, 

i&I i&I 

=G 

where I is an index set defining suffixes Ui above. Here the language G is finite 
and X C G. Indeed if uq = 1, then no = 1, and hence uqX'^° = 1 ■ X = X C G. 
Since C(A1) is semigroup and X is included in G, we obtain 

e{x) = e(x)+ = {Gx*)+ = {x + g)+ = g+. 

Now we can prove that the centralizer of our language X = {a,bb, aba,bab, 
bbb} is A1+. The word bab is incomparable. The set of proper suffixes of X is 
{1, a, 6, ab, ba, 66}. We will consider all of these words separately: 

Mo = 1 : 1 • AT C C(A1) so that no = 1. 
ui = a : a G X C C{X) so that ni = 0. 

M 2 = 6 : b- a”' ■ a ^ XC{X) = C{X)X and therefore b- dP ^ for all n G N. 
This means that the number ri 2 does not exist. 
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U3 = ab : a ■ ab- {bab)'^ ^ Suf(X+) implies aa6(6a6)” ^ so that 

aab{bab)'^ ^ XG{X) and therefore 06(606)” ^ C(-^) for all n G N. 
Hence the number ns does not exist. 

Ui = ba : ba- XQ{X) and therefore ba ■ ^ C(-^) for all n G N, 

and hence the number ns does not exist. 

Us = bb : bb G X C C(X) so that ns = 0 . 

As a conclusion / = { 0 , 1 , 5 }, and G = Uig/ UiX^' = l-X + a+bb = X. This 
gives us the centralizer 

e(A) = GX* = XX* = X+, 
in other words we have established: 

Fact 1 C({o, 66, aba, bab, 666}) = {a, 66, aba, hah, 666}+. 

Next we prove that the fixed point approach applied to the language X leads 
to an infinite loop of iterations. We prove this by showing that there exist words 
in C(A) \ Ai for every Xi of the iteration ( 1 ). To do this we take a closer look on 
the language L = {bab)* ab{bab)* . Clearly L C Xq = Pref(A+) n Suf(A+) and 
in A+ = 0. 

By the definition of the fixed point approach, word w G is in A^+i if and 
only if Xw C A^A and wX C AA^. We will check this condition for an arbitrary 
word {bab)^ ah{bab)"‘ G L with k,n> 1 . The first condition Xw C A^A leads to 
the cases: 



a ■ ( 6 o 6 )'=a 6 ( 6 a 6 )” = (o 6 o )(66 • a)'=-i( 6 a 6 )”+i G A+A C A^A, ' 
66 • ( 6 a 6 )'=a 6 ( 6 o 6 )” = ( 666 )a (66 • a)’^~\bab)^+^ G A+A C A, A, 
aba ■ \bab)^ ab{bab)'^ = a{bab)a{bb ■ a)^“^(6o6)”+^ G A+A C A^A, 
666 • (6a6)'=a6(6o6)” = (66)^0(66 • a)’^~^ {bab)’^+^ G A+A C A, A 



( 3 ) 



and 

bab ■ {bah)^ ab{bab)'^ = (606)^+^06(606)"“^ • bab G A^A. 
However, the last one holds if and only if 

(6o6)'=+io6(6o6)”“i G Ai. 

Similarly, the second condition wX C XXi yields us: 



( 4 ) 



(606)^06(606)" • o = 6o6(6o6)^“^(o-66)”o6o G AA+ C AA^, 

{hab)^ ah{bah)'^ ■ 66 = 6o6(6o6)^“^(o-66)”o • 666 G AA+ C AA^, 
{bab)^ ah{bah)'^ ■ aba = 6o6(6o6)^“^(o-66)”o-o-6o6-o G AA+ C AAj, 
(606)^06(606)” • 666 = 6o6(6o6)^“^(o-66)”o-o-66- 66 G AA+ C AA^ 



and 



(606)^06(606)" • bab = bab ■ {bab)’^ ^06(606)”+^ G AA^. 
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Here the last one holds if and only if 

(&a6)'=-^a6(&a6)”+i G Xi. (6) 

From (4) and (6) we obtain the equivalence 

{bab)^ab{bab)^ & Xi+i {bah)^^^ab{bab)^~^ ,{bab)^~^ab{bab)^^^ & Xi (7) 

Now, the result follows by induction, when we cover the cases fc = 0 or n = 0. 

In the case fc = 0 and n > 0 we have that a6(6o6)” G Xq, but ab{bab)'^ ^ X\, 
since a ■ ab{bab)'^ ^ XqX. The same applies also for (6a6)"6a by symmetry. 



ah 

ahihah) ihah)ah 

ahibab)^ (bab)ab(bab) (bab)^ab 

ab(bab)^ (bab)ab(bab)^ (bab)^ab(bab) (bab)^ab 

ab(babY (bab)ab(bab)^ (bab)^ ab(baby (bab)^ab(bab) {bab)‘^ab 



Fig. 1. Language (bab)* ab{bab)* written as a pyramid. 



Xo^Xi 




Xi-^X2 




Fig. 2. Deleting words in (bab)* ab{bab)* from languages Xi during the iteration. 



In the case n = 0 and fc > I we first note that X{bab)^ab C XgX, due to the 
equations in (3) and the fact that bab-{bab)^ab = {bab)^ba-bab G XqX. Similarly, 
{bab)^abX C XXq, due to the equations in (5) and the fact {bab)^ab ■ bab = 
bab ■ {bab)^~^ ab{bah) G XXq. These together imply that {bab)’^ab G Xi. On 
the other hand, since {bab)^ba ^ Xi, then bab ■ {bab)^ab ^ XiX, and hence 
{bab)^ab ^ X 2 - 
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Over all we obtain the following result: if z = min{A:, n + 1}, then 
{babfab{babY but {babfab{babYiXi+i. 

The above can be illustrated as follows. If the language (bab)* ab{bab)* is 
written in the form of a downwards infinite pyramid as shown in Figure 1, then 
Figure 2 shows how the fixed point approach deletes parts of this language during 
the iterations. In the first step Xq Xi only the words in ab{bab)* are deleted as 
drawn on the leftmost figure. The step X\ X 2 deletes words in {bab)ab{bab)* 

and {bab)*ab, and so on. On the step Xi Xi+i the operation always deletes 

the remaining words in{babYab{bab)* and (bab)* ab{baby~^ , but it never manages 
to delete the whole language (bab)* ab{bab)* . This leads to an infinite chain of 
steps as shown in the following 

Fact 2 XoD XiD ■■■XiD ■■■ e{X). 

When computing the approximations Xi the computer and available software 
packages are essential. We used Grail+, see [18]. For languages X and their 
minimal automata are shown in Figure 3. 




Fig. 3. Finite automata recognizing languages X and X~^ 



Let us consider the minimal automata we obtain in the iteration steps of 
the procedure, and try to find some common patterns in those. The automaton 
recognizing the starting language Xq is given in Figure 4. 
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The numbers of states, finals states and transitions for a few first steps of the 
iteration are given in Table 1. From this table we can see that after a few steps 
the growth becomes constant. Every step adds six more states, three of those 
being final, and eleven transitions. 




Fig. 4. Finite automaton recognizing the language Xq 



When we draw the automata corresponding to subsequent steps from Xi to 
W+i, as in Figures 5 and 6, we see a clear pattern in the growth. See that 
automata representing languages X^ and Xq are built from two sequence of sets 
of three states. In the automaton of Xq both sequences have got an additional 
set of three states. So there are totally six new states, including three new final 
ones, and eleven new transitions. In every iteration step the automata seem 
to use the same pattern to grow. When the number of iteration steps goes to 
infinity, the lengths of both sequences go also to infinity. Then the corresponding 
states can be merged together and the result will be the automaton recognizing 
the language X^ . This seems to be a general phenomena in the cases where 
an infinite number of iterations are needed. Intuitively that would solve the 
Conway’s Problem. However, we do not know how to prove it. 

5 Conclusions 

Our main theorem has a few consequences. As theoretical ones we state the 
following two. These are based on the fact that formula (1) in the recursion is 
very simple. Indeed, if X is regular so are all the approximations X^. Similarly, 
if X is recursive so are all these approximations. Of course, this does not imply 
that also the limit, that is the centralizer, should be regular or recursive. 
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final states 


states 


transitions 


Ao 


6 


8 


15 


Ai 


5 


9 


17 


A2 


6 


13 


24 


As 


9 


19 


35 


A4 


12 


25 


46 


As 


15 


31 


57 


Ae 


18 


37 


68 


At 


21 


43 


79 


As 


24 


49 


90 


Ag 


27 


55 


101 


Aio 


30 


61 


112 


All 


33 


67 


123 


Ai2 


36 


73 


134 


Ais 


39 


79 


145 


Ai4 


42 


85 


156 


Ais 


45 


91 


167 


Ai6 


48 


97 


178 


Ait 


51 


103 


189 


Ai8 


54 


109 


200 


Ai9 


57 


115 


211 





final states 


states 


transitions 


A 20 


60 


121 


222 


A 21 


63 


127 


233 


A 22 


66 


133 


244 


A 23 


69 


139 


255 


A 24 


72 


145 


266 


A 25 


75 


151 


277 


A 26 


78 


157 


288 


A 2 T 


81 


163 


299 


A 28 


84 


169 


310 


A 29 


87 


175 


321 


A 30 


90 


181 


332 


A 31 


93 


187 


343 


A 32 


96 


193 


354 


A 33 


99 


199 


365 


A 34 


102 


205 


376 


A 35 


105 


211 


387 


A 36 


108 


217 


398 


Ast 


111 


223 


409 


A 38 


114 


229 


420 


A 39 


117 


235 


431 



Table 1. The numbers of states, final states and transitions of automata corre- 
sponding to the iteration steps for the language X. 



What we can conclude is the following much weaker result, first noticed in [8]: 

Theorem 2. If X is recursive, then C(X) is in co-RE, that is its complement 
is recursively enumerable. 

Proof. As we noticed above, all approximations Xi are recursive, and moreover 
effectively findable. Now the result follows from the identity 

w = u^> 

i>0 

where bar is used for the complementation. Indeed, a method to algorithmically 
list all the elements of C(Ai) is as follows: Enumerate all words w\,W 2 ,wz, • ■ • 
and test for all i and j whether Wi G Xj. Whenever a positive answer is obtained 
output Wi. 

For regular languages X we have the following result. 

Theorem 3. Let X G A+ be regular. If C(A) is regular, even noneffectively, 
then 6(A) is recursive. 

Proof. We assume that X is regular, and effectively given, while 6(A) is regular 
but potentially nonconstructively. We have to show how to decide the member- 
ship problem for C(A). This is obtained as a combination of two semialgorithms. 
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Fig. 5. The finite automaton recognizing 



A semialgorithm for the question “X G C(A)?” is obtained as in the proof 
of Theorem 2. A semialgorithm for the complementary question is as follows: 
Given x, enumerate all regular languages Li, L 2 , ■ ■ ■ and test whether or not 



(z) L,X = XL, 



and 



(zz) X G Li. 



Whenever the answers to both of the questions is affirmative output the input 
X. The correctness of the procedure follows since C(A) is assumed to be regular. 
Note also that the tests in (i) and (ii) can be done since Li’s and X are effectively 
known regular languages. 



Theorem 3 is a bit amusing since it gives a meaningful example of the case 
where the regularity implies the recursiveness. Note also that a weaker assump- 
tion that C(Ai) is only context-free would not allow the proof, since the question 
in (i) is undecidble for contex-free languages, see [4]. 

We conclude with more practical comments. Although we have not been 
able to use our fixed point approach to answer Conway’s Problem even in some 
new special cases, we can use it for concrete examples. Indeed, as shown by 
experiments, in most cases the iteration terminates in a finite number of steps. 
Typically in these cases the centralizer is of one of the following forms: 
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Fig. 6. The finite automaton recognizing Xg 



(i) 

(ii) or 

(iii) {w G A+ I wX, Xw C XX+}. 

However, in some cases, as shown in Section 4, an infinite number of iterations 
are needed - and consequently some ad hoc methods are required to compute 
the centralizer. 

The four element set of example 1, is an instance where the centralizer is not 
of any of the forms (i)-(iii). This, as well as some other such examples, can be 
verified by the fixed point approach using a computer, see [16]. 
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Abstract. Biomolecular computing requires the use of carefully crafted 
DNA words. We compare and integrate two known different theoretical 
approaches ([2,3]) to the encoding problem for DNA languages: one is 
based on a relativization of concepts related to comma free codes, the 
other is a generalization to any given mapping 6 of the notion of words 
appearing as substrings of others. Our first results show which parts of 
these two formalisms are in fact equivalent. The remaining remarks sug- 
gest how to define, to the benefits of laboratory experiments, properties 
of DNA encodings which merge relativization from [2] and the generality 
of 9 from [3]. 

1 Introduction 

A notion which has always been interesting for the problem of DNA encoding 
is that of comma free codes ([!]): in comma free codes, no word will appear as 
a substring across the concatenation of two words, i.e., somehow joining them 
together. This means that when concatenating many words from a comma free 
code, there will be no mistake in looking for borders between adjacent words, 
and thus, for instance, no mistake in parsing the message by starting from the 
middle of the string. 

This property fits nicely with what we want in a test tube. If we perform a 
detection experiment with short probes on a long single stranded DNA molecule, 
each probe could hybridize independently to any corresponding pattern, each 
acting as a small parser working inside the longer string. 

Head, in [2] , described a formalism where it was proved how even codes which 
are not comma free could be used for biocomputing experiments. In fact, they 
can be split in sub-codes which are comma free, and whose corresponding sets 
of DNA probes can be put separately in the test tube, so to avoid unwanted 
hybridizations. 

But one difference between strings and DNA molecules is that DNA probes 
could also hybridize among them, or the longer molecule could fold and self- 
hybridize. For this reason, theoretical models of biomolecular computation and 
DNA encoding try to consider how to avoid Watson-Crick reverse complemen- 
tarity, which is required by self- hybridization. 

In particular, Hussini et al. [3] define a formalism which further generalize 
these considerations. They study languages that avoid, in strings resulting from 
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the concatenation of codewords, the undesired appearance of substrings which 
are mapping of other codewords, according to some generic morphism, or anti- 
morphism, 9. 

In this paper we will formally present some relationships between the above 
mentioned two formalisms. In Section 3 we show that the first one is a relati vising 
generalization of the second one, for 6 being the identity mapping. On the other 
hand, in Section 4 we also suggest a way to combine the relativization of [2] with 
the generality of 9 of [3]. 

As background references, we can point for some theory of codes to [1], and for 
a presentation of DNA computing to [7]. Further results related to the approach 
of [3] are presented in [6] and [5] , and a software implementing some ideas from 
such formalism is discussed in [4]. 

2 Definitions and Known Resnlts 

For a finite alphabet X, X* is the free monoid generated by X. Elements of X* 
will be called indifferently strings or words. 

Any subset of X* is a language. In particular, a subset C of X* is a code 
if it is uniquely decipherable, that is, for each word w € C*, there is only one 
non-negative integer n and one finite sequence ci, C 2 , . . . , c„_i, c„ of words in 
C for which w = CiC2 . . . c„-ic„. Please note that the choice of considering a 
code as being always uniquely decipherable is the same as in [3] , while in [2] any 
language was also considered a code. 

A string w in X* is a factor of a language L if there are strings x,y G X* for 
which xwy S L. 

Among the definitions from [2], we are interested in those dealing with lan- 
guages which avoid some ways of having common patterns in pairs of words. 

Definition 1. A language C is called solid ijf the following two properties are 
satisfied: 

1. u and puq in C can hold only if pq is null, 

2. pu and uq in C and u non-null can hold only if pq is null. 

Moreover, if a language satisfies property 1 above, then it is called solid/1. 

Property 1 forbids having a words inside another word, while the second required 
property forbids having the suffix of a word overlapping the prefix of another 
word. 

The following definition introduces the notion of relativization, where the 
same properties above are verified only on words appearing in strings of a (po- 
tentially) different language. 

Definition 2. A language C is called solid relative to a language L iff the 
following two properties are satisfied: 

1. w = ypuqz in L with u and puq in C can hold only if pq is null, 

2. w = ypuqz in L with pu and uq in C and u non-null can hold only if pq is 
null. 
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Moreover, if a language satisfies property 1 above, then it is called solid/1 
relative to L. 

One last definition from [2] considers another relativised property: 

Definition 3. A string w in A* is a join relative to a language L iff w is a 
factor of L and for every u, v in A* for which uwv in L, both u and v are also 
in L. A word w in a language L is a join in L iff w is a join relative to L* . 
J{L) will denote the set of all joins in the language L. 

In [2] it is proved that: 

— if a language on alphabet A is solid, then it is solid relative to every language 
in A* , 

~ if a language is solid relative to a language L, then it is solid relative to every 
subset of L, 

— if C is a code, then J{C) is solid relative to C* (Proposition 4 in [2]). 

The definitions from [3] deal with involutions. An involution 9 \ A ^ Ais & 
mapping such that 6“^ equals to the identity mapping /: I{x) = 9{9{x)) = x for 
all X £ A. The following definitions, presented here with notations made uniform 
with those of [2], describe languages which avoid having a word mapped to a 
substring of another word, or to a substring of the concatenation of two words. 

Definition 4. If 9 : A* ^ A* is an involution, then a language L C A* is said 
to be 6*-compliant iff u and x9{u)y in L can hold only if xy is null. 

Moreover, if LO 9{L) = 0, then L is called strictly 9-compliant. 

In [5] it is proved that a language is strictly 0-compliant iff u and x9{u)y in L 
never holds. 

Definition 5. If 9 : A* A* is an involution, then a language L C A* is said 
to be 6*-free iff u in L with x9{u)y in can hold only if x or y are null. 
Moreover, if Lf] 9{L) = 0, then L is called strictly 9-free. 

In [3] it is proved that a language is strictly 0-free iff m in T with x9{u)y in 
never holds. 

In [3] it is also proved that if a language L is 6*-free, then both L and 9{L) 
are 0-compliant. 



3 Relationships Between the Two Formalisms 

If we restrict ourselves to the case of 6* = /, with I being the identity mapping, 
then we can discuss the similarities between the properties of a DNA code as 
defined in [2] compared to those defined in [3]. 

First of all we observe that /-compliance and /-freedom cannot be strict, 
since L n I{L) = L (unless we are in the case of L = 0). 
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Moreover, /-compliance is equivalent to enjoying the first property required 
by solidity, which we called solidity/1. Therefore, if a language L is solid, or L 
is solid relative to a language in which all strings in L are factors, then L is also 
/-compliant. 

Finally, it is obvious that if L is solid/ 1 then it is solid/ 1 relative to any 
language. 

The following simple lemma considers another fact about solidity/ 1 proper- 
ties, and will also be used later. It applies, for instance, to the case oi M = L, 
or M = L'^ , or M = L* . 

Lemma 1. A language L is solid/1 iff L is solid/1 relative to a language M in 
which all strings in L are factors. 

Proof. Solidity/ 1 is obviously stronger than any relative solidity/ 1; on the other 
hand, relative solidity/ 1 of M verifies that no word of L is proper substring of 
any other. 



Proposition 1. A language L is solid relative to L* iff L is solid relative to Lf. 

Proof. It is immediate to verify that solidity relative to L* implies solidity rel- 
ative to since is a subset of L*. We prove the other direction by contra- 
diction: if L is not solid relative to L* then either L falsifies condition 1, and 
the same would be relatively to , or a word in falsifies relative solidity, and 
this itself would be the contradiction, or a word w = ypuqz € with pq non 

null and n > 1, would contradict solidity. This last case would falsify solidity 
relative to L” in one of the ways that we will see, and so lead to contradiction 
for L^. If w has a parsing in according to which either yp has a prefix or qz 
has a suffix in L, then we could build from w, by dropping such prefix or suffix, 
a word which would falsify solidity in L”. If no such parsing exists, then pu or 
uq falsify condition 1 of relative solidity, by spanning on at least one word of L 
in the string w. 



Proposition 2. A language L is I -free iff L is solid relative to L* . 

Proof. Proposition 1 allows us to reduce the statement to consider only equiva- 
lence with solidity relative to L^. We know that /-freedom implies /-compliance, 
which in turns implies solidity/ 1 and, by Lemma 1, solidity/ 1 relative to M = L^. 
By absurd, if L is /-free but would not enjoy second property required by the 
relative solidity, then we would have a string w G such that w = ypuqz with 
u non null and pu,uq € L. This would contradict solidity/ 1 relative to L^, in 
case yz, or p or q, would be null, or /-freedom in other cases; for instance, if y 
and q are non null, word pu & L would falsify /-freedom. 

In the other direction, the solidity relative to Lf implies the /-freedom, oth- 
erwise we would have a word st = yuz G such that u G L and both y and z 
are non null; u would contradict property 1 of relative solidity, if it is a substring 
of s or t, or property 2 otherwise. 
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We could say that these links between the two formalisms come at no surprise, 
since both [2] and [3] chose their definitions in order to model the same reactions, 
between DNA molecules, which have to be avoided. 

To complete the picture, we could define the following language families: 

~ SOL, which contains all languages which are solid, 

— IF, which contains all languages which are /-free, 

— RSn and RS^,, which contain all languages L which are solid relative to L" 
and all languages L which are solid relative to L* , respectively. 

We can show the existence of a language Ei which is in RS2 but not in SOL, 
and of a language E2 which is in RSi but not in RS2' E\ = {adb, bda}, E2 = 
{ab,c,ba} {E 2 was suggested for a similar purpose in [2]). 

The above results allow to state that, with n > 2: 



SOL c RS2 = RSn = RS* = IF c RSi. 

We observe that this can be considered as an alternative way of proving the 
correctness of the two choices made in [2] and [3], which let by definition the 
family of comma free codes be in and IF, respectively. It appears that both 
formalisms extend the notion of comma free codes: the first toward languages 
which are not comma free, but which can contain comma free subsets J{C), and 
the second toward the more general 0-freedom property. 



4 Further Relativization 

Relativization allowed in [2] to show how to scale an experiment on DNA mol- 
ecules on a code C which is not comma free but that can be split in a number 
of sub-codes J{Ck) which are comma free. This concept can be illustrated by 
this informal notation: C = IJj, J{Ck), where Co = C,Ci+i = Ci\J{Ci). In fact, 
we could use large codes, even when they are not comma free, by performing 
the hybridization step of a DNA experiment as a short series of sub-steps. In 
each sub-step we mix in the test tube, for instance, a long single stranded DNA 
molecule with short probes which are words of a single Ck ■ 

Here we want to move from comma free codes, i.e., solid w.r.t. Kleene clo- 
sure or, equivalently, /-free, to a definition similar to that of J(C) but rel- 
ative to 0-freedom, where involution 0 is different from I. For instance, we 
could be interested in the antimorphic involution t on the four letter alpha- 
bet A = {A,G,C,T} representing DNA nucleotides; r will be defined so to 
model the reverse complementarity between DNA words. 

We first need to define the complement involution 7 : Z\* ^ A* as mapping 
each A to T, T to A, G to C, and C to G. The mirror involution p, : A* A* 
maps a word u to a word p{u) = v defined as follows: 

u = aiQ 2 ■ . . Ofc, V = Qk ■ ■ ■ 0201, Oi G A,1 < i < k. 
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Now we can define the involution r = /i 7 = 7 /i on A * , modeling the Watson- 
Crick mapping between complementary single-stranded DNA molecules. 

Results from Section 3, and the fact that J(C) C C, allow us to restate 
Proposition 4 from [2] (quoted in our Section 2) as follows: if C is a code, then 
J(C) is /-free. In a similar way, we want a definition of Jt(C'), subset of C, so 
that Jt{C) is r-free. 

The experimental motivation of this goal is that r-freedom, and not I- 
freedom or relative solidity, would protect from the effects of unwanted reverse 
complementarity, /-freedom guarantees that if we put a set of probes from J((7) 
in the test tube, they will correctly hybridize to the hypothetical longer DNA 
molecule, even if this is built from the larger set C. Nonetheless, among words 
in J{C) there could still be words which can hybridize with words from the 
same set J{C), interfering with the desired reactions. This would be formally 
equivalent to say that J{C) is not r-free. 

We conjecture that we could define Jt(C) as being related to J(C), as follows: 

MC) = J(C)\r(J(C)). 

Such a subset of C would be /-free, since it is a subset of J(C), and would 
avoid self hybridization among words of Jr(C'). Further, it would induce a split- 
ting in each J{C), which would still allow to operate detection, i.e., matching 
between probes and segments on a longer molecule, in a relativised, step by step 
way, even if with a greater number of steps. Further theoretical studies on this 
issue is being carried on. 



5 Final Remarks 

In order to make these theoretical results actually useful for laboratory DNA 
experiments, more work should be done on example codes. 

It would be interesting to generate codes which enjoy the properties defined 
in Section 4, and to compare them to codes generated as in Proposition 16 of [3] 
or as in Proposition 4 of [5] . 

On the other hand, algorithms could be studied that would decide whether 
a given code enjoys properties defined in Section 4, and whether there is a way 
of splitting it into a finite number of Jr{Ck) sub-codes. 
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Abstract. We consider a variant of test tube systems communicating 
by applying splicing rules, yet without allowing operations in the tubes 
themselves. These test tube systems communicating by applying splic- 
ing rules of some restricted type are shown to be equivalent to a variant 
of splicing test tube systems using a corresponding restricted type of 
splicing rules. Both variants of test tube systems using splicing rules 
are proved to be equivalent to membrane systems with splicing rules as- 
signed to membranes, too. In all the systems considered in this paper, 
for the application of rules leading from one configuration to the suc- 
ceeding configuration we use a sequential model, where only one splicing 
rule is applied in one step; moreover, all these systems have universal 
computational power. 



1 Introduction 

Already in 1987, (a specific variant of) the splicing operation was introduced by 
Tom Head (see [10]). Since then, (formal variants of) splicing rules have been 
used in many models in the area of DNA computing, which rapidly evolved after 
Adleman in [1] had described how to solve an instance of the Hamiltonian path 
problem (an NP-complete problem) in a laboratory using DNA. The universality 
of various models of splicing systems (then also called H systems) was shown in 
[8] and [14]. 

Distributed models using splicing rules theoretically simulating possible lab 
implementations were called test tube systems. The universality of test tube 
systems with splicing rules first was proved in [2]; optimal results with respect 
to the number of tubes showing that two tubes are enough were proved in [7] 
(Pixton’s results in [16] show that only regular sets can be generated using only 
one tube) . A first overview on the area of DNA computing was given in [15], 
another comprehensive overview can be found in [17]. 

In this paper we introduce a variant of test tube systems communicating by 
applying splicing rules to end-marked strings, yet without allowing operations in 
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the tubes themselves. At each computation step, i.e., at each communication step 
from tube i to tube j, only one splicing rule is applied using one axiom (which 
we assume to be provided by the communication rule) and one other string 
from tube i (which is no axiom); the result of the application of the splicing 
rule then is moved to tube j. We suppose that this variant of test tube systems 
communicating by splicing may be interesting for lab applications, too. For all 
other systems considered in this paper, we shall also use this special variant of 
splicing rules where exactly one axiom is involved. 

In 1998, Gheorghe Paun introduced membrane systems (see [11]), and since 
then the field of membrane systems (soon called P systems) has been growing 
rapidly. In the original model, the rules responsible for the evolution of the 
systems were placed inside the region surrounded by a membrane and had to be 
applied in a maximally parallel way. In most variants of P systems considered so 
far, the membrane structure consisted of membranes hierarchically embedded in 
the outermost skin membrane, and every membrane enclosed a region possibly 
containing other membranes and a finite set of evolution rules. Moreover, in the 
regions multisets of objects evolved according to the evolution rules assigned 
to the regions, and by applying these evolution rules in a non-deterministic, 
maximally parallel way, the system passed from one configuration to another 
one, in that way performing a computation. 

Sequential variants of membrane systems were introduced in [5]. Various 
models of membrane systems were investigated in further papers, e.g., see [3], 
[11], [12]; for a comprehensive overview see [13], and recent results and develop- 
ments in the area can be looked up in the web at [19]. 

A combination of both ideas, i.e., using splicing rules in P systems, was 
already considered from the beginning in the area of membrane systems, e.g., see 
[3]; for other variants, e.g., splicing P systems with immediate communication, 
and results the reader is referred to [13]. Non-extended splicing P systems and 
splicing P systems with immediate communication with only two membranes 
can already generate any recursively enumerable language as is shown in [18]. 
In [6], sequential P systems with only two membranes as well as splicing rules 
and conditions for the objects to move between the two regions (in some sense 
corresponding to the filter conditions of test tube systems) were shown to be 
universal. In [9] the model of P systems with splicing rules assigned to membranes 
was introduced; using strings from inside or outside the membrane, a splicing 
rule assigned to the membrane is applied and the resulting string is sent inside 
or outside the membrane. As the main results in this paper we show that these 
P systems with splicing rules assigned to membranes are equivalent to splicing 
test tube systems as well as to test tube systems communicating by splicing, 
which are introduced in this paper. 

In the following section we first give some preliminary definitions and recall 
some definitions for splicing systems and splicing test tube systems; moreover, 
we introduce our new variant of test tube systems communicating by applying 
splicing rules (not using splicing rules in the test tubes themselves); finally, we 
recall the definition of P systems with splicing rules assigned to membranes. In 
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the third section we show that both variants of test tube systems considered in 
this paper are equivalent and therefore have the same computational power. In 
the fourth section we show that these variants of test tube systems using splicing 
rules and P systems with splicing rules assigned to membranes are equivalent, 
too. An outlook to related results and a short summary conclude the paper. 

2 Definitions 

In this section we define some notions from formal language theory and recall 
the definitions of splicing schemes (H-schemes, e.g., see [2], [14]) and splicing test 
tube systems (HTTS, e.g., see [2], [7]). Moreover, we define the new variant of 
test tube systems with communication by splicing (HTTCS). Finally, we recall 
the definition of membrane systems with splicing rules assigned to membranes 
as introduced in [9]. 



2.1 Preliminaries 

An alphabet P is a finite non-empty set of abstract symbols. Given V, the free 
monoid generated by V under the operation of concatenation is denoted by V*; 
the empty string is denoted by A, and V* \ {A} is denoted by V~^. By | x | we 
denote the length of the word x over V. For more notions from the theory of 
formal languages, the reader is referred to [4] and [17]. 



2.2 Splicing Schemes and Splicing Systems 

A molecular scheme is a pair a = {B, P), where B is a, set of objects and P is a 
set of productions. A production p in P in general is a partial recursive relation 
C X B™ for some k,m> 1, where we also demand that for all w G the 
range p (w) is finite, and moreover, there exists a recursive procedure listing all 
V G P'" with (w, v) G p. For any two sets L and L' over P, we say that L' is 
computable from P by a production p if and only if for some (wi, ...,Wk) G P^ 
and {vi,...,Vm) G P™ with (wi, ...,Wk,vi, ...,Vm) G p we have {ici, ...,Wk} C L 
and L' = LU{vi , ..., Vm} we also write L =^p L' and L L' . A computation 

in (T is a sequence Lq, ...,Ln such that Pi C P, 0 < z < n, n > 0, as well as 

Pi =^cr Pi-K, 0 < z < n; in this case we also write Pq Pm moreover, 

we write Pq P" pQ P« ^^r some zz > 0. A molecular system is a 

triple a = (P, P, A) where (P, P) is a molecular scheme and A is a set of axioms 
from P. An extended molecular system is a quadruple a = (B,Bt,P,A), where 
(P, P, A) is a molecular system and P^ is a set of terminal objects with Bt C P. 
The language generated by a is 

L{a) = {zc I A =^* P, w G LD Bt} ■ 

Throughout this paper we shall consider end-marked strings as objects in a 
molecular scheme or system as well as in membrane systems. End-marked strings 
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are objects of the form mwn, where m,n G M, M a set of markers, and w G W* 
for an alphabet W with W H M = 

A splicing scheme (over end-marked strings) is a pair cr, cr = {MW*M, R) , 
where M is a set of markers, W is an alphabet with W D M = (d, and 

RC{MU {A}) W*#W* {M U {A}) $ (M U {A}) W*#W* {M U {A}) ; 

#, $ are special symbols not in M U FF; i? is the set of splicing rules. For x,y,z G 
MW*M and a splicing rule r = in R we define (x, y) =^r z if and 

only if a; = a;iMiU2a:2, y = yiU3U4y2, and z = X1U1M4J/2 for some xi,yi G MW*U 
{A} , X 2 , 2/2 G W*M U {A} . By this definition, we obtain the derivation relation 
=^a for the splicing scheme a in the sense of a molecular scheme as defined 
above. An extended H-system (or extended splicing system) 7 is an extended 
molecular system of the form 7 = {MW*M, MtW^Mt, R, A) , where Mt Q M 
is the set of terminal markers, Vr ^ V is the set of terminal symbols, and A is 
the set of axioms. 

2.3 Splicing Test Tube Systems 

A splicing test tube system {HTTS for short) with n test tubes is a construct cr, 
a = {MW*M, MtW^Mt, Ai, ..., A„, h,..., i?i, ..., i?„, D) 



where 

1. M is a set of markers, W is an alphabet with W C\ M = 0; 

2. Mt Q M is the set of terminal markers and Wt C FF is the set of terminal 
symbols; 

3. Ai,...,An are the sets of axioms assigned to the test tubes where 

n 

Ai C MW*M^ 1 < i < n] moreover, we define A := (J Ap, 

i—1 

4 . are the sets of initial objects assigned to the test tubes l,...,n, 

n 

where A C MW*M, 1 < i < n; moreover, we define I := (J A and claim 
An / = 0; 

5. Ri,...,Rn are the sets of splicing rules over MW*M assigned to the test 

n 

tubes 1, ..., n, 1 < i < n; moreover, we define i? := IJ Ri; every splicing rule 

i—1 

in Ri has to contain exactly one axiom from Aj (which, for better readability, 
will be underlined in the following) as well as to involve another end-marked 
string from MW*M \ A; 

6. D is a (finite) set of communication relations between the test tubes in cr of 
the form (t, F^j ) , where 1 < i < n, 1 < j < n, and F' is a filter of the form 
{A} FF* {Bj with A,BgM (for any i,j, there may be any finite number of 
such communication relations). 

In the interpretation used in this section, a computation step in the system 
cr run as follows: In one of the n test tubes, a splicing rule from Ri is applied 
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to an object in the test tube (which is not an axiom in A) together with an 
axiom from Ai. If the resulting object can pass a filter F for some {i, F,j) G D, 
then this object may either move from test tube i to test tube j or else remain 
in test tube i, otherwise it has to remain in test tube i. The final result of the 
computations in cr consists of all terminal objects from MtW^Mt that can be 
extracted from any of the n tubes of a. 

We should like to emphasize that for a specific computation we assume all 
axioms and all initial objects not to be available in an infinite number of copies, 
but only in a number of copies sufficiently large for obtaining the desired result. 



2.4 Test Tube Systems Communicating by Splicing 

A test tube system communicating by splicing {HTTCS for short) with n test 
tubes is a construct cr, 

cr = {MW*M,MTWrMT,A,h,...,I^,C) 



where 

1. M is a set of markers, W is an alphabet with W C] M = 0; 

2. Mt C M is the set of terminal markers and Wt C IT is the set of terminal 
symbols; 

3. A is a (finite) set of axioms, A C MW* M] 

4. are the sets of initial objects assigned to the test tubes 

n 

where A C MW*M, 1 < i < n; moreover, we define I := IJ A and claim 
An / = 0; 

5. C is a (finite) set of communication rules of the form (z, r,j) , where 1 < z < 
n, 1 < j < n, and r is a splicing rule over MW*M which has to contain 
exactly one axiom from A as well as to involve another end-marked string 
from MW*M \ A; moreover, we define R := (J {r} . 

(i,r,j)eC 



In the model defined above, no rules are applied in the test tubes themselves; 
an end-marked string in test tube z is only affected when being communicated 
to another tube j if the splicing rule r in a communication rule (i,r,j) can be 
applied to it; the result of the application of the splicing rule r to the end-marked 
string together with the axiom from A which occurs in r is communicated to 
tube j. The final result of the computations in cr consists of all terminal objects 
from MtW^Mt that can be extracted from any of the n tubes of cr. 

We should like to point out that again we avoid (target) conflicts by assum- 
ing all initial objects to be available in a sufficiently large number of copies, 
and, moreover, we then assume only one copy of an object to be affected by a 
communication rule, whereas the other copies remain in the original test tube. 
Obviously, we assume the axioms used in the communication rules to be available 
in a sufficiently large number. 
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Example 1. The HTTCS (see Figure 1) 

({A, B, Z} {a, b}* {A, B, Z} , {A, B} {a, b}* {A, B} , C, {AaZ, ZbB} , {AabB} , 0) 
with the communication rules 

1. {1, Aa=ffZ$A=ffa,2) (using the axiom AaZ) and 

2. (2, b=f^B%Z=f^bB, 1) (using the axiom ZbB) 
generates the linear (but not regular) language 

{Aa”+i6”B, | n > 1} . 





AabB 
































1 


Aa#Z%A#a 


2 


b#B%Z#bB 



Fig. 1. Example of an HTTCS. 



2.5 P Systems with Splicing Rules Assigned to Membranes 

A P system with splicing rules assigned to membranes (PSSRAM for short) is a 
construct 77, 

77 = {MW*M, MtW^Mt, Aq, ..., A„, 7q, ..., 7„, 7?i, ..., 7?„) 



where 

1. A7 is a set of markers, W is an alphabet with WDM = 0; 

2. Mt C M is the set of terminal markers and Wt C FF is the set of terminal 
symbols; 

3. /r is a membrane structure (with the membranes labelled by non-negaitive 
integers 0, ...,n in a one-to-one manner); 

4. Ao,...,A„ are the sets of axioms assigned to the membranes 0,...,n, where 

n 

Ai C MW* 0 < i < n; moreover, we define A := UAp, 

i—0 

5. 7o,...,7„ are the sets of initial objects assigned to the membranes 0,...,n, 

n 

where 7j C MW*M, 0 < i < n; moreover, we define 7 := IJ 7^ and claim 

i=0 

An 7 = 0; 
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6. i?i, ...,Rn are the sets of rules assigned to the membranes 1 
of the form 



(ori, or 2 ; r; tar) 



n which are 



for a splicing rule r over MW* M; every splicing rule occurring in a rule in Ri 
has to contain exactly one axiom from A as well as to involve another end- 
marked string from MW*M \ A; or\,or 2 G {in^out} indicate the origins 
of the objects involved in the application of the splicing rule r, whereas 
tar G {in, out} indicates the target region the resulting object has to be sent 
to, where in points to the region inside the membrane and out points to the 
region outside the membrane. Observe that we do not assign rules to the 
skin membrane labelled by 0. 



A computation in U starts with the initial configuration with the axioms from 
Ak as well as the initial objects from R, 1 < k < n, being placed in region k. 
We assume all objects occurring in Ak U Ik, 1 < fc < n, to be available in an 
arbitrary (unbounded) number. A transition from one configuration to another 
one is performed by applying a rule from Rk, 1 < k < n. The language generated 
by n is the set of all terminal objects w G MtW^Mt obtained in any of the 
membrane regions by some computation in U. 

We should like to emphasize that we do not demand the axioms or initial 
objects really to appear in an unbounded number in any computation of U. 
Instead we apply the more relaxed strategy to start with a limited but large 
enough number of copies of these objects such that a desired terminal object 
can be computed if it is computable by II when applying the rules sequentially 
in a multiset sense; we do not demand things to evolve in a maximally parallel 
way. In that way we avoid target conflicts, i.e., the other copies of the strings 
involved in the application of a rule (being consumed or generated) just remain 
in their regions. This working mode of a P system with splicing rules assigned 
to membranes also reflects the sequential way of computations considered in the 
models of test tube systems deflned above. 



3 Splicing Test Tube Systems and Test Tube Systems 
Communicating by Splicing Are Equivalent 

In this section we show that the splicing test tube systems and the test tube 
systems communicating by splicing newly introduced in this paper are equivalent 
models for generating end-marked strings. 

Theorem 1. For every HTTS we can construct an equivalent HTTCS. 

Proof. Let 



a = {MW*M, MrWfMT, Ai 



■■■? An, I\ i?l, ..., R D) 



be an HTTS. 
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Ri 








Rj 


Ai li 








Aj Ij 


i 




= {A}W*{B} 


3 



Fig. 2. Elements of HTTS a. 



Then we construct an equivalent HTTCS 



a' = MtW^Mt, A, h,..., I[, C) 



where 

1. M' = MU {Z}; 

2. A = U U {XZ, ZX\X &M}-, 

3. for every filter / , 1 < m < rnj, riij > 0, of the form {A} W* {B} 

between the tubes i and j we introduce a new test tube {i,j, m) in ct'; these 
new test tubes altogether are the I additional test tubes we need in cr'; 

4. /fc = 0, 1 < fc < ^; 

5. the communication rules in C are defined in the following way: 




Fig. 3. Simulation of r G Ri in a' . 



(b) every splicing rule r in Ri is simulated by the communication rule (f, r, i) 
in C (here we use the axioms from Ai), see Figure 3; 



A#Z%A# #B$Z#B 




i (b j, m) j 



Fig. 4. Simulation of filter {A}W*{B} between test tubes i and j in a'. 
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(c) every filter / {i,j, m) , 1 < m < mj > 0, of the form {A} W* {B} 
between the tubes i and j is simulated by the communication rules 
(i,A^Z$A^,{i,j,m)) and in C (here we use 

the additional axioms AZ and ZB), see Figure 4. 

The definitions given above completely specify the HTTCS a' generating the 
same language as the given HTTS a. q 

Theorem 2 . For every HTTCS we can construct an equivalent HTTS. 

Proof. We have to guarantee that in the HTTS every splicing rule is applied 
only once when simulating the splicing rule used at a communication step in the 
HTTCS (see Figure 5). 




i r j 

Fig. 5 . Communication rule in HTTCS. 



For this simulation, we have to consider two cases: 



1. the splicing rule r in the communication rule (i,r,j) is of the form r = 
{^AuiffviB%Cu2#V2) ; then for every X G M we take two new test tubes 
{(i,r,j),X,l) and {{i,r,j),X, 2 ). In test tube {{i,r,j),X,l), the splicing 



rule f = {^Auiffv\B%Cu 2 ffv-^ is applied using the axiom i 



AuiViB. The 



resulting end-marked string beginning with the marker A then can pass the 
filter {A}W*{X} from tube ((z, r,j) , X, 1) to tube ((z, r,j) ,X, 2 ), where the 
splicing rule r' = (^AuiffviB%Au\fj^ is applied using the (original) axiom 
ax = AuiV\B. After that, the resulting end-marked string can pass the filter 
{A}W*{X} to test tube j. 

2 . the splicing rule r in the communication rule (i,r,j) is of the form r = 
(uiffviD%Au2#V2B) ; then for every X £ M we take two new test tubes 
{{i,r,j),X,l) and ((z, r, j) , A, 2) . In test tube {{i,r,j),X,l), the splicing 



rule r = (^uiffviD%Au2ffv2B^ is applied using the axiom ax = AU2V2B. 



The resulting end-marked string ending with the marker B then can pass the 
filter {X}W*{B} from tube ((z, r,j ) , X, 1) to tube ((z, r,j) ,X, 2 ), where the 
splicing rule r' = (^ffv2B$Au2ffv2B^ is applied using the (original) axiom 
ax = AU2V2B. After that, the resulting end-marked string can pass the filter 
{X}W*{B} to test tube j. 



From these constructions described above, it is obvious how to obtain a 
complete description of the HTTS for a given HTTCS; the remaining details are 
left to the reader. n 
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{C}W*{X} {A}W*{X} {A}W*{X} 

or or or 

{X}W*{D} {X}W*{B} {X}W*{B} 




{{i,r,j),X,2) j 



Fig. 6. Simulation of communication rule (i,r,j) in HTTS. 



As we can derive from the results proved in [7], the results proved in this 
section show that test tube systems communicating by splicing have universal 
computational power; in contrast to splicing test tube systems, where according 
to the results proved in [7] already systems with two test tubes are universal, 
we cannot bound the number of membranes in the case of test tube systems 
communicating by splicing. 



4 Test Tube Systems Using Splicing Rules and Membrane 
Systems with Splicing Rnles Assigned to Membranes 
Are Equivalent 



In this section we show that when equipped with splicing rules assigned to mem- 
branes, (sequential) P systems are equivalent to test tube systems using splicing 
rules. 



Theorem 3. For every FITTS we can construct an equivalent PSSRAM. 
Proof. Let 



a = APrWfMlr, A [, ..., A^, /( , ..., If, ..., Rf, D) 



be an HTTS. 

Then we construct an equivalent PSSRAM (see Figure 7) 

n = {MW*M, MrWfMr, fi, Aq, ..., A„+,, / q , ..., Ri, ..., Rn+i) 



where 

1. M = {Xi \ X e M',1 <i<n}U{Z}-, Mt = {Xi \ X G Mf., 1 < z < n} ; 

2. for every test tube i we use the membrane i in II; moreover, for every filter 
/ {i,j, m) G D, 1 <m < ntj, Uij > 0, of the form {A} W* {B} between the 
tubes i and j we take a membrane (i,j, m) in 77; these additional membranes 
altogether are I membranes within the skin membrane; 

3. Aq = %, Ai = {CiwDi I CwD G Ai\ , 1 <i <n; 

A{^,j,m) = {AjZ,ZBj I f{i,j,m) = {A}W* {B} , {i, f ,j) G D}; 
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Ai 



i 



lo 




0 



Fig. 7. PSSRAM simulating HTTS. 



4. Iq = {AiwBi I AwB € 1 < i < n} ; /fc = 0, 1 < fc < 1; 

5. Ri = I (in, out] CiUi#viDS AiU 2 #V 2 ', out) \ (Cui#'U 2 -D$Au 2 #f 2 ) G A'}u 

I (out, in; ui=f^viBS CiU 2 #V 2 Di ', out) \ (ui#viB$Cu 2 #V 2 D) G i?'} , 

1 < t < n; 

moreover, for i ^ j (which can be assumed without loss of generality) let 
/ = {A}W*{B} be a filter between tubes i and j; then we take 

R{i,j,m) = I (in, out; Aj^Z%Ai^; inj , (in, in; ^BSZ^Bj; out^ | . 

The construction elaborated above proves that for every splicing test tube 
system we can effectively construct an equivalent membrane system with splicing 
rules assigned to membranes. q 

Theorem 4. For every PSSRAM we can construct an equivalent HTTCS. 

Proof. Let 



n = {MW*M, MtW^Mt, ti, Ao, A„, h, In, Ri, ■■■, Ru) 

be a PSSRAM. Without loss of generality, we assume that every axiom needed 
in the splicing rules of i?i, 1 < i < n, is available in the corresponding given sets 
of axioms Aj, 0 < j < n. Then we can easily construct an equivalent HTTCS 

cr= {MW* M, MrWfMr, A, lo, In,C) 
where A = I IJ j and the communication rules in C are defined in the 

\i=0 / 

following way: 

Let A: be a membrane, 1 < A: < n (remember that there are no rules assigned 
to the skin membrane, which is labbelled by 0), and let the surrounding mem- 
brane be labelled by I, and let {ori,or 2 ',r;tar) be a rule assigned to membrane 
k for some splicing rule r. Then (ori, or 2 ', r; tar) can be simulated by a commu- 
nication rule {i,r,j) , where j = k for tar = in and j = I for tar = out as well 
as 
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1. for r being of the form i = k for ot 2 = in and i = I for 

or 2 = out, 

2. for r being of the form u\-=^vi%Au 2 '^V 2 B, i = k for ori = in and i = I for 
ori = out. 

The construction elaborated above proves that for every membrane system 
with splicing rules assigned to membranes we can effectively construct an equiv- 
alent test tube system communicating by splicing. q 

As the results in this section show, membrane systems with splicing rules 
assigned to membranes are eqivalent to the variants of test tube systems using 
splicing rules that we considered in the previous section, and therefore they have 
universal computational power (also see [9]), too. 

5 Summary and Related Results 

In this paper we have shown that splicing test tube systems and the new variant 
of test tube systems communicating by splicing are equivalent to (sequential) 
P systems using splicing rules assigned to membranes. Although using quite 
restricted variants of splicing rules on end-marked strings, all the systems con- 
sidered in this paper have universal computational power, which results already 
follow from results obtained previously. 

For splicing test tube systems, the results proved in [7] show that two test 
tubes are enough for obtaining universal computational power, which result is 
optimal with respect to the number of tubes. For P systems using splicing rules 
assigned to membranes, the results proved in [9] (there the environment played 
the role of the region enclosed by the skin membrane) show that (with respect 
to the definitions used in this paper) two membranes are enough, which result 
then is optimal with respect to the number of membranes. On the other hand, 
the number of test tubes in the new variant of test tube systems communicating 
by splicing cannot be bounded. 
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Abstract. Novel approaches to information encoding with DNA are 
explored using a new Watson-Crick structure for binary strings more 
appropriate to model DNA hybridization. First, a Gibbs energy analy- 
sis of codeword sets is obtained by using a template and extant error- 
correcting codes. Template-based codes have too low Gibbs energies that 
allow cross-hybridization. Second, a new technique is presented to con- 
struct arbitrarily large sets of noncrosshybridizing codewords of high 
quality by two major criteria. They have a large minimum number of 
mismatches between arbitrary pairs of words and alignments; moreover, 
their pairwise Gibbs energies of hybridization remain bounded within a 
safe region according to a modified nearest-neighbor model that has been 
verihed in vitro. The technique is scalable to long strands of up to 150- 
mers, is in principle implementable in vitro, and may be useful in further 
combinatorial analysis of DNA structures. Finally, a novel method to en- 
code abiotic information in DNA arrays is dehned and some preliminary 
experimental results are discussed. These new methods can be regarded 
as a different implementation of Tom Head’s idea of writing on DNA 
molecules [22], although only through hybridization. 



1 Introduction 

Virtually every application of DNA computing [23,1,17] requires the use of ap- 
propriate sequences to achieve intended hybridizations, reaction products, and 
yields. The codeword design problem [4,19,3] requires producing sets of strands 
that are likely to bind in desirable hybridizations while minimizing the probabil- 
ity of erroneous hybridizations that may induce false positive outcomes. A fairly 
extensive literature now exists on various aspects and approaches of the problem 
(see [4] for a review) . Approaches to this problem can be classified as evolution- 
ary [7,15,9] and conventional design [6,19]. Both types of method require the use 
of a measure of the quality of the codewords obtained, through either a fitness 
function or a quantifiable measure of successful outcomes in test tubes. 

Although some algorithms have been proposed for testing the quality of code- 
word sets in terms of being free of secondary structure [4,10], very few methods 
have been proposed to systematically produce codes of high enough quality to 
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guarantee good performance in test tube protocols. Other than greedy “generate 
and filter” methods common in evolutionary algorithms [8] , the only systematic 
procedure to obtain code sets for DNA computing by analytic methods is the 
template method developed in [2]. An application of the method requires the use 
of a binary word, so-called template, in combination with error-correcting codes 
from information theory [26], and produces codewords set designs with DNA 
molecules of size up to 32— mers (more below.) 

This paper explores novel methods for encoding information in DNA strands. 
The obvious approach is to encode strings into DNA strands. They can be stored 
or used so that DNA molecules can self-assemble fault-tolerantly for biomolecular 
computation [9,11,19,18]. In Section 2, a binary analog of DNA is introduced as 
a framework for discussing encoding problems. Section 3.2 describes a new tech- 
nique, analogous to the tensor product techniques used in quantum computing 
for error-correcting codes [29] , to produce appropriate methods to encode infor- 
mation in DNA-like strands and define precisely what “appropriate” means. It is 
also shown how these error-preventing codes for binary DNA (BNA, for short) 
can be easily translated into codeword sets of comparable quality for DNA- 
based computations. Furthemore, two independent evaluations are discussed of 
the quality of these codes in ways directly related to their performance in test 
tube reactions for computational purposes with DNA. We also compare them to 
code sets obtained using the template method. 

Direct encoding into DNA strands is not a very efficient method for stor- 
age or processing of massive amounts (over terabytes) of abiotic data because 
of the enormous implicit cost of DNA synthesis to produce the encoding se- 
quences. A more indirect and more efficient approach is described in Section 4. 
Assuming the existence of a large basis of noncrosshybridizing DNA molecules, 
as obtained above, theoretical and experimental results are presented that allow 
a preliminary assesment of the reliability and potential capacity of this method. 
These new methods can be regarded as a different implementation of Tom Head’s 
idea of aqueous computing for writing on DNA molecules [22,21], although only 
hybridization is involved. Section 5 summarizes the results and presents some 
preliminary conclusions about the technical feasibility of these methods. 

2 Binary Models of DNA 

DNA molecules can only process information by intermolecular reactions, usu- 
ally hybridization in DNA-based computing. Due to the inherent uncertainty 
in biochemical processes, small variations in strand composition will not cause 
major changes in hybridization events, with consequent limitations on using 
similar molecules to encode different inputs. Input strands must be ’’far apart” 
from each other in hybridization affinity in order to ensure that only desirable 
hybridizations occur. The major difficulty is that the hybridization affinity be- 
tween DNA strands is hard to quantify. Ideally, the Gibbs energy released in the 
process is the most appropriate criterion, but its exact calculation is difficult, 
even for pairwise interactions among small oligos (up to 60— mers), and using 
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approximation models [8]. Hence an exhaustive search of strands sets of words 
maximally separated in a given coding space is infeasible, even for the small 
oligo-nucleotides useful in DNA computing. 

To cope with this problem, a much simpler and computationally tractable 
model, the /i-distance, was introduced in [18]. Here, we show how an abstrac- 
tion of this concept to include binary strings can be used to produce code sets 
of high enough quality for use in vitro, i.e., how to encode symbolic informa- 
tion in biomolecules for robust and fault-tolerant computing. By introducing a 
DNA-like structure into binary data, one expects to introduce good properties 
of DNA into electronic information processing while retaining good properties 
of electronic data storage and processing (such as programmability and reliabil- 
ity). In this section it is shown how to associate basic DNA-like properties with 
binary strings, and in the remaining sections we show how to use good coding 
sets in binary with respect to this structure to obtain good codeword sets for 
computing with biomolecules in vitro. 



2.1 Binary DNA 

Basic features of DNA structure can be brought into data representations tradi- 
tionally used in conventional computing as follows. Information is usually repre- 
sented in binary strings, butthey are treated them as analogs of DNA strands, 
and refer to them as binary oligomers, or simply hiners. The pseudo-base (or ab- 
stract base) 0 binds with 1, and vice versa, to create a complementary pair 0/1, 
in analogy with Watson-Crick bonding of natural nucleic bases. These concepts 
can be extended to biners recursively as follows 

(xa)^ :=y^x^, and(xa)’"= := (1) 

where the superscripts R and wc stand for the reversal operation and the Watson- 
Crick complementary operation, respectively. The resulting single and double 
strands are referred to as binary DNA, or simply BNA. Hybridization of single 
DNA strands is of crucial importance in biomolecular computing and it thus 
needs to be defined properly for BNA. The motivation is, of course, the analogous 
ensemble processes with DNA strands [30]. 

2.2 /i-Distance 

A measure of hybridization likelihood between two DNA strands has been in- 
troduced in [18]. A similar concept can be used in the context of BNA. The 
h-measure between two biners x, y is given by 



\x,y\ 



min H{x,a\yn), 

n<.k<n 



(2) 



where <t^ is the (right-) left-shift by k positions (if k < 0, resp.), 2 /“° is the 
Watson-Crick complement of y obtained by reversing y and exchanging Os and 
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Is, and *) is the ordinary Hamming distance between binary strands. This 
/i-measure takes the minimum of all Hamming distances obtained by successively 
shifting and lining up the reverse complement of y against x. If some shift of one 
strand perfectly matches a segment of the other, the measure is reduced by the 
length of the matching segment. Thus a small measure indicates that the two 
biners are likely to stick to each other one way or another. Measure 0 indicates 
perfect complementarity. A large measure indicates that in whatever relative 
position X finds itself in the proximity of y, they are far from containing a large 
number of complementary base pairs, and are therefore less likely to hybridize, 
i.e., they are more likely to avoid an error (unwanted hybridization). Strictly 
speaking, the /i-measure is not a distance function in the mathematical sense, 
because it does not obey the usual properties of ordinary distances {e.g., the 
distance between different strands may be 0 and, as a consequence, the trian- 
gle inequality fails). One obtains the /i-distance, however, by grouping together 
complementary biners and passing to the quotient space. These pairs will con- 
tinue to be referred to as biners. The distance between two such biners X, Y is 
now given by the minimum /i-measure between all four possible pairs |a;,j/|, one 
X from X and one y from Y. There is the added advantage that in choosing a 
strand for encoding, its Watson-Crick complement is also chosen, which goes will 
the the fact that it is usually required for the required hybridization reactions. 

One can use the /i-distance to quantify the quality of a code set of bin- 
ers. An optimal encoding maximizes the minimum /i-distance among all pairs 
of its biners. For applications in vitro, a given set of reaction conditions is ab- 
stracted as a parameter r > 0 giving a bound on the maximum number of 
base pairs that are permitted to hybridize between two biners before an error 
(undesirable hybridization) is introduced in the computation. The analogy of 
the problem of finding good codes for DNA-based computing and finding good 
error-correcting codes in information theory was pointed out in [18,9]. The the- 
ory of error-correcting codes based on the ordinary Hamming distance is a well 
established field in information theory [26]. A t- error- correcting code [26] is a set 
of binary strands with a minimum Hamming distance 2t -|- 1 between any pair 
of codewords. A good encoding set of biners has an analogous property. The 
difficulty with biomolecules is that the hybridization affinity, even as abstracted 
by the /i-distance, is essentially different from the Hamming distance, and so 
the plethora of known error-correcting codes [26] doesnt translate immediately 
into good DNA code sets. New techniques are required to handle the essentially 
different situation, as further evidenced below. 

In this framework, the codeword design problem becomes to produce en- 
coding sets that permit them to maintain the integrity of information while 
enabling only the desirable interactions between appropriate biners. Using these 
codes does not really require the usual procedure of encoding and decoding in- 
formation with error-correcting codes since there are no errors to correct. In 
fact, there is no natural way to enforce the redundancy equations characteris- 
tic of error-correcting codes in vitro. This is not the only basic difference with 
information theory. All bits in code sets are information bits, so the codes have 
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rate 1, and are therefore more efficient for a given encoding length. The result of 
the computation only needs to re-interpreted in terms of the assigned mapping 
used to transduce the original problem’s input into DNA strands. Therefore, 
given the nature of the inspiring biochemistry, analogous codes for hybridization 
distance are more appropriately called error-preventing codes, because the min- 
imum separation in terms of hybridization distance prevents errors, instead of 
enabling error detection and correction once they have occurred. 

2.3 Gibbs Energy 

Ideally, the Gibbs energy released in the hybridization process between strand 
pairs is the most appropriate criterion of quality for a code set for experiments in 
vitro. Although hybridization reactions in vitro are governed by well established 
rules of local interaction between base pairings, difficulties arise in trying to 
extend these rules to even short oligonucleotides (less than 150-mers) in a variety 
of conditions [28] . Hence an exhaustive search of strand sets of words maximally 
separated in a given coding space is infeasible, even for the small size of oligo- 
nucleotides useful in DNA computing. 

Computation of the Gibbs energy thus relies on approximations based on 
various assumptions about the type of interactions between neighboring bonds. 
Various models have been proposed for the range of oligonucleotides used in 
DNA-based computing, major among which are the nearest-neighbor model and 
the staggered-zipper model [28]. We use an extension of the nearest neighbor 
model proposed by [8] that computes optimal alignments between DNA oligonu- 
cleotides using a dynamic programming algorithm. There is evidence that this 
calculation of the Gibbs energy, while not as computationally efficient as the 
/i-distance, is a good predictor of actual hybridization in the range of 20- to 
60-mers in PGR reactions in vitro [5]. 

3 Error-Preventing Codes 

Codes designs can now be described based on the tensor product operation and 
compared with codes obtained using the template method [2]. Both types of 
codes are obtained by first finding good biner sets (a template or a seed set) 
which is then used to generate a set of BNA strings. These codes are then 
transformed into code sets for DNA computation in real test tubes, so the basic 
question becomes how they map to codeword sets in DNA and how their quality 
(as measured by the parameter t) maps to the corresponding reaction conditions 
and DNA ensemble behavior. 

3.1 The Template Method 

The template method requires the use of a template n-biner T which is as far 
from itself as possible in terms of /i-measure, i.e., \T,T\ » 0, and an error- 
correcting code C with words of length n. The bits are mapped to DNA words 
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by grouping (see below) DNA bases into two groups, the 0-group (say a,t) and 
the 1-group (say c,g), corresponding to the bits of T, and then rewriting the 
codewords in C so that their bits are interpreted as one (a or c) or the other 
{g or t). Codewords are generated in such a way that the distances would be 
essentially preserved (in proportion of 1 : 1), as shown in Table 1(b). 

The quality of the code set produced naturally depends on the quality of self- 
distance of T and the error-correcting capability of C. If a minimum self-distance 
is assured, the method can be used to produce codewords with appropriate c / g 
content to have nearly uniform melting temperatures. In [2] , an exhaustive search 
was conducted of templates of length I < 32 and templates with minimum self- 
distance about l/S. However, the thermodynamical analysis and lab performance 
are still open problems. To produce a point of comparison, the template method 
was used, although with a different seed code (32-bit BCH codes [26]). The 
pairwise Gibbs energy profiles of the series of template codeword sets obtained 
using some templates listed in [2] are shown below in Fig. 2(a). 



3.2 Tensor Products 

A new iterative technique is now introduced to systematically construct large 
code sets with high values for the minimum /i-distance r between codewords. The 
technique requires a base code of short biners (good set of short oligos) to seed 
the process. It is based on a new operation between strands, called the tensor 
product between words from two given codes of s- and t-biners, to produce a 
code of s -I- t-biners. The codewords are produced as follows. Given two biners 
X = ab, and y from two coding sets C, D of s- and t-biners, respectively, where 
a, b are the corresponding halves of x, new codewords of length s -I- t are given 
hy X 0 y = a'y'b' , where the prime indicates a cyclic permutation of the word 
obtained by bringing one or few of the last symbols of each word to the front. 
The tensor product of two sets C0D is the set of all such products between pairs 
of words X from C and y from D, where the cyclic permutation is performed once 
again on the previously used word so that no word from C or D is used more 
than once in the concatenation a'y'b' . The product code C0D therefore contains 
at least |C'||Z1| codewords. In one application below, the construction is used 
without applying cyclic permutations to the factor codewords with comparable, 
if lesser, results. 



Size/r 


Best codes 


5/1 

6/2 

7/2 

8/2 


10000,10100,11000 
100010.100100,110000 
1000010,1001000,1100000 
11000100, 1 1010000, 1 1 100000 



BNA 


DNA base 


000,010/111,101 

011,100/001,110 


a/t 

c/g 



Table 1. (a) Best seed codes, and (b) mapping to convert BNA to a DNA strand. 
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To seed the process, we conducted exhaustive searches of the best codes of 
5- and 6-biners possible. The results are shown in Table 1(a). No 3-element sets 
of 7- or 8-biners exist with /i-distance r > 2. It is not hard to see that if n, m are 
the minimum /i-distances between any two codewords in C and D, respectively, 
then the minimum distance between codewords va. C <Z) D may well be below 
n + m. By taking the two factors C = D to he equal to the same seed n-biner 
code C from Table 1, we obtain a code D' . The operation is iterated to obtain 
codes of 2n-, An-, 8n-biners, etc. Because of the cyclic shifts, the new codewords 
in D are not necessarily at distance s -I- s = 2s, double the minimum distance s 
between words in the original words. We thus further pruned the biners whose 
/i-distances to others are lowest and most common, until the minimum distance 
between codewords that remain is at least 2s. In about a third of the seeds, 
at least one code set survives the procedure and systematically produces codes 
of high minimum /i-distance between codewords. Since the size of the code set 
grows exponentially, we can obtain large code sets of relatively short length by 
starting with seeds of small size. Fig 1(a) shows the growth of the minimum 
/i-distance over in the first few iterations of the product with the seeds shown 
in Table 1(a) in various combinations. We will discuss the tube performance of 
these codes in the following section, after showing how to use them to produce 
codeword sets of DNA strands. 




Length of Biners 
Min - Avg — - Max] 



(») 





000/111 010(101 




(i» 



Fig. 1. (a) Minimum /i-distance in tensor product BNA sets; and (b) two meth- 
ods to convert biners to DNA oligonucleotides. 



3.3 Prom BNA to DNA 

Two different methods were used. The first method (grouping) associates to 
every one of the eight 3-biners a DNA base a, c, g, or t as shown in Table 1(b). 
Note that Watson-Crick complementarity is preserved, so that the hybridization 
distances should be roughly preserved (in proportion 3:1). 

The second method (position) assigns to each position in a biner set a fixed 
DNA base a, c, g, or t. The assignment is in principle arbitrary, and in particular 
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can be done so that the composition of the strands follows a predetermined 
pattern (e.g., 60% c/g's and 40% a/t’s). All biners are then uniformly mapped to 
a DNA codeword by the same assignment. The resulting code thus has minimum 
/i-distance between DNA codewords no smaller than in the original BNA seed. 
Fig. 2(b) shows the quality of the codes obtained by this method, to be discussed 
next. 



3.4 Evaluation of Code Quality 

For biner codes, a first criterion is the distribution of /i-distance between code- 
words. Fig. 1(a) shows this distribution for the various codes generated by tensor 
products. The average minimun pairwise distance is, on the average, about 33% 
of the strand length. This is comparable with the minimum distance of 1/3 in 
template codes. 

As mentioned in [2], however, the primary criterion of quality for codeword 
sets is performance in the test tube. A reliable predictor of such performance is 
the Gibbs energy of hybridization. A computational method to compute the 
Gibbs energy valid for short oligonucleotides (up to 60 bps) is given in [8]. 
Gode sets obtained by generate-and-filter methods using this model have been 
tested in the tube with good experimental results [8,10]. The same Gibbs energy 
calculation was used to evaluate and compare the quality of code sets generated 
by the template and tensor product methods. According to this method, the 
highest Gibbs energy allowable between two codewords must be, at the very 
least, —6 Kcal/mole if they are to remain noncrosshybridizing [8]. 

Statistics on the Gibbs energy of hybridization are shown in Fig. 2. The tem- 
plate codes of very short lengths in Fig. 2(a) remain within a safe region (above 
—6 kGal/Mole), but they have too low Gibbs energies that permit crosshybridiza- 
tion for longer lengths. On the other hand, although the tensor product operation 
initially decreases the Gibbs energy, the Gibbs energies eventually stabilize and 
remain bounded within a safe region that does not permit cross hybridization, 
both on the average and minimum energies. 

Fig. 3 shows a more detailed comparison of the frequency of codewords in 
each of the two types of codes with a Gibbs energy prone to cross hybridization 
(third bar from top). Over 80% of the pairwise energies are noncrosshybridiz- 
ing for template codes (top and middle bars), but they dip below the limit in 
a significant proportion. On the other hand, all pairwise energies predict no 
crosshybridization among tensor product codewords, which also exhibit more 
positive energies. Finally, the number of codewords produced is larger with ten- 
sor products, and increases rapidly when the lengths of the codewords reaches 
64— and 128— mers (for which no template codes are available.) 

4 Encoding Abiotic Information in DNA Spaces 

This section explores the broader problem of encoding abiotic information in 
DNA for storage and processing. The methods presented in the previous section 
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Fig. 2. Gibbs energies of (a = above) template codes; and (b = below) tensor 
product codes. 
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could possibly be used to encode symbolic information (strings) in DNA single 
strands. A fundamental problem is that the abiotic nature of the information 
would appear to require massive synthesis of DNA strands proportional to the 
amount to be encoded. Current methods produce massive amounts of DNA 
copies of the same species, but not of too many different species. Here, some 
theoretical foundation and experimental results are provided for a new method, 
described below. Theis method can be regarded as a new implementation of Tom 
Head’s idea of aqueous computing for writing on DNA molecules [22,?], although 
in through simpler operations (only hybridization.) 

4.1 A Representation Using a Non-crosshybridizing Basis 

Let H be a set of DNA molecules (the encoding basis, or “stations” in Head’s 
terminology [22], here not necessarily bistable), which is assumed to be finite 
and noncrosshybridizying according to some parameter r (for example, the Gibbs 
energy, or the /i-distance mentioned above.) For simplicity, it is also assumed that 
the length of the strands in H is a fixed integer n, and that B contains no hairpins. 
If r = 0 and the /i-distance is the hybridization criterion, a maximal such set 
B can be obtained by selectiong one strand from every (non-palindromic) pair 
of Watson-Crick complementary strands. If r = n, a maximal set B consists of 
only one strand of length n, to which every other strand will hybridize under 
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Tensor Product Template 

Log Base 2 of the Code Word Length 

■ Less Than -6 OBtwn -6 and 0 ■ Gieater Than or Equal 0 

Fig. 3. Code size in error-preventing DNA Codes. Also shown are the percent- 
ages of pairwise Gibbs energies in three regions for each code (crosshybridizing: 
third bar from top, and noncrosshybrizing: top two.) 



the mildest hybridization condition represented by such large r. Let N = \B\ be 
the cardinality of B. Without loss of generality, it will be assumed that strings 
to be encoded are written in a four letter alphabet {a,c, g,t}. 

Given a string x (ordinarily much larger than n), a; is said to be h-dependent 
of B is there some concatenation c of elements of B such that x will hybridize to 
c under stringency r, i.e., |c, a;| < r. Shredding x to the corresponding fragments 
according to the components of c in i? leads to the following slightly weaker but 
more manageable definition. The signature of x with respect to i? is a vector V 
of size N obtained as follows. Shredding x to |Ai|/n fragments of size n or less, 
Vi is the number of fragments f of x that are within threshold r of strand i in 
B, i.e., such that \f,i\ < r/2. The vector V can be visualized as a ID signature, 
or as 2D matrix shown in Fig. 4(a), where the bases strands have been arranged 
on as they would be on a 2D DNA-chip, with one basis strand per spot. 

Under realistic reaction conditions, the vector V may appear not to be well- 
defined, since it is clear that its calculation depends on the concentration of 
basis as well the strands x used in an experiment to compute it. To avoid these 
difficulties, this situation will be idealized to the case where only the same num- 
ber of copies (say, one copy) of each strand is present in the tube, and that the 
tube is small enough that all possible hybridizations occur within reasonable 
time. This idealization is based on the results of the following experiments. The 
experiments were performed in simulation on a virtual test tube that is known 
to provide highly reliable predictions of wet tube reactions (see [13] for more 
details). 
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Six large plasmids of lengths varying between 3K and 4.5Kbps were shredded 
to fragments of size 40 or less and thrown into a tube containing a basis B of three 
different qualities (values of r) consisting of 36— mers. The first set Nxh consisted 
of a set of non-crosshybridizing 40-mers, the length of the longest fragment in 
the shredded plasmids x; the second set Randxh was a randomly chosen set of 
40-mers; and the third set, a set Mxh of maximally crosshybrizing 40-mer set 
consisting of 18 copies of the same strand and 18 copies of its Watson-Crick 
complement. The hybridization reactions were determined by the /i-distance for 
various thresholds r between 0% and 50% of the basis strand length n, which 
varied between n = 10,20,40. Each experiment was repeated 10 times under 
identical conditions. The typical signatures of the six plasmids are illustrated 
in Fig. 4, averaged pixelwise for one of them, together with the corresponding 
standard deviation of the same. Fig. 5(a) shows the signal-to-noise ratio (SNR) in 
effect size units, ie., given by the pixel signals divided by the standard deviation 
for the same plasmid, and Fig. 5(b) shows the chipwise SNR comparison for all 
plasmids, in the same scale as shown in Fig. 4. 







(a) (b) (c) 

Fig. 4. Signatures of a plasmid genome in (a) one run (left); (6) average 
over 10 runs; and (c) noise (right, as standard deviation), on a set on non- 
crosshybridizing probes. 



Examination of these results indicates that the signal obtained from the 
Mxh set is the weakest of all, producing results indistinguishable from noise, 
certainly due to the random wandering of the molecules in the test tube and their 
enountering different base strands (probes.) This is in clear contrast with the 
signals provided by the other sets, where the noise can be shown to be about 10% 
of the entire signal. The random set Randxh shows signals in between, and the 
Nxh shows the strongest signal. The conclusion is that, under the assumptions 
leading to the definition above, an appropriate base set and a corresponding 
hybridization stringency do provide an appropriate definition of the signature of 
a given set. 
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Fig. 5. Signal-to-noise ratio (a) pixelwise and (b) chipwise for all 6 plasmids. 



4.2 The Dimension of DNA Spaces 

Under the conditions of the definition of h-dependence, the signature of a single 
base strands is a delta function (a single pixel), while it is perfectly possible 
that an /i-independent strand from B may have a signal-less signature. Thus 
there remain two questions, First, how to ensure the completeness of signatures 
for arbitrary strings x (of length n or larger) on a basis B of given n— mers, 
i.e., to ensure that the original strand x can be somewhat re-constructed from 
its signature V. And second, how to ensure universal representation for every 
strand, at least of length n. These problems are similar to, although distinct, of 
T. Head’s Common Algoritmic Problem (CAP) [22], where a subest of largest 
cardinality is called for that fails to contain every one of the subsets in a given 
a familiy of forbidden subsets of DNA space of n— mers. Here, every strand 
i € B and a given r determine a subset of strands Si that hybridize i under 
stringency r. We now want the smallest B that maximizes the “intersection” 
with every subset Sx in order to provide the strongest signal to represent an 
input strand x. (An instance of CAP with the complementary sets only asks for 
the smallest cardinality subset that just “hits”, with a nontrivial intersection, 
all the permissible sets, i.e., the complements of the forbidden subsets.) 

It is clear that completeness is ensured by choosing a maximal set of h- 
independent strands for which every strand of length n provides a nonempty 
signature. Such a set exists assuming that the first problem is solved. The h- 
dimension of a set of n— mer strands is defined as a complete basis for the 
space. The computation of the dimension of the full set of n— mer is thus tightly 
connected to its structure in terms of hybridization and it appears to be a difficult 
problem. 

The first problem is more difficult to formulate. It is clear that if one insists 
in being able to recover the composition of strand x from its signature V{x), 
the threshold must be set to t = 0 and the solution becomes very simple, if ex- 
ceedingly large (exponential in the base strand length.) However, if one is more 
interested in the possibility of storing and processing information in terms of 
DNA chip signatures, the question arises whether the signature is good enough 
a representation to process x from the point of view of higher-level semantic re- 
lationships. Evidence in [14] shows that there is a real possibility that signatures 
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for relatively short bases are complete for simple discriminating tasks that may 
suffice for many applications. 



5 Summary and Conclusions 

A new approach (tensor product) has been introduced for the systematic gen- 
eration of sets of codewords to encode information in DNA strands for compu- 
tational protocols so that undesirable crosshybridizations are prevented. This is 
a hybrid method that combines analytic techniques with a combinatorial fitness 
function (^.-distance [18]) based on an abstract model of complementarity in 
binary strings. A new way to represent digital information has also been intro- 
duced using such sets of noncrosshybrizing codewords (possibly in solution, or 
affixed to a DNA-chip) and some properties of these representations have been 
explored. 

An analysis of the energetics of codeword sets obtained by this method show 
that their thermodynamic properties are good enough for acceptable perfor- 
mance in the test tube, as determined the Gibbs energy model of Deaton et al. 
[8] . These results also confirm that the /i-distance is not only a computationally 
tractable model, but also a reliable model for codeword design and analysis. The 
final judge of the quality of these code sets is, of course, the performance in the 
test tube. Preliminary evidence in vitro [5] shows that these codes are likely to 
perform well in wet tubes. 

A related measure of quality is what can be termed the coverage of the code, 
i.e., how well spread the code is over the entire DNA space to provide represen- 
tation for every strand. This measure of quality is clearly inversely related to 
the error-preventing quality of the code, it is also directly related to the capacity 
of DNA of a given length to encode information, which is given by a quantity 
that could be termed dimension of the space of DNA strands of length n, al- 
though properties analogous to the concept of dimension in euclidean spaces are 
yet to be assessed. Finally, determining how close tensor product codes come to 
optimal size is also a question worthy of further study. 
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Abstract. Recent research has considered DNA as a medium for ultra- 
scale computation and for ultra-compact information storage. One po- 
tential key application is DNA-based, molecnlar cryptography systems. 
We present some procedures for DNA-based cryptography based on one- 
time-pads that are in principle nnbreakable. Practical applications of 
cryptographic systems based on one-time-pads are limited in conven- 
tional electronic media by the size of the one-time-pad; however DNA 
provides a much more compact storage medium, and an extremely small 
amount of DNA snfflces even for huge one-time-pads. We detail proce- 
dures for two DNA one-time-pad encryption schemes: (i) a substitution 
method using libraries of distinct pads, each of which defines a specihc, 
randomly generated, pair-wise mapping; and (ii) an XOR scheme uti- 
lizing molecular computation and indexed, random key strings. These 
methods can be applied either for the encryption of natural DNA or for 
artificial DNA encoding binary data. In the latter case, we also present 
a novel use of chip-based DNA micro-array technology for 2D data in- 
put and output. Finally, we examine a class of DNA steganography sys- 
tems, which secretly tag the input DNA and then hide it within collec- 
tions of other DNA. We consider potential limitations of these stegano- 
graphic techniques, proving that in theory the message hidden with such 
a method can be recovered by an adversary. We also discuss various 
modihed DNA steganography methods which appear to have improved 
security. 



1 Introduction 

1.1 Biomolecular Computation 

Recombinant DNA techniques have been developed for a wide class of operations 
on DNA and RNA strands. There has recently arisen a new area of research 
known as DNA computing, which makes use of recombinant DNA techniques 
for doing computation, surveyed in [37]. Recombinant DNA operations were 
shown to be theoretically sufficient for universal computation [19]. Biomolecular 
computing (BMC) methods have been proposed to solve difficult combinatorial 
search problems such as the Hamiltonian path problem [1], using the vast paral- 
lelism available to do the combinatorial search among a large number of possible 
solutions represented by DNA strands. For example, [5] and [41] propose BMC 
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methods for breaking the Data Encryption Standard (DES). While these meth- 
ods for solving hard combinatorial search problems may succeed for fixed sized 
problems, they are ultimately limited by their volume requirements, which may 
grow exponentially with input size. However, BMC has many exciting further 
applications beyond pure combinatorial search. For example, DNA and RNA are 
appealing media for data storage due to the very large amounts of data that can 
be stored in compact volume. They vastly exceed the storage capacities of con- 
ventional electronic, magnetic, optical media. A gram of DNA contains about 
10^^ DNA bases, or about 10® tera-bytes. Hence, a few grams of DNA may 
have the potential of storing all the data stored in the world. Engineered DNA 
might be useful as a database medium for storing at least two broad classes of 
data: (i) processed, biological sequences, and (ii) conventional data from binary, 
electronic sources. Baum [3] has discussed methods for fast associative searches 
within DNA databases using hybridization. Other BMC techniques [38] might 
perform more sophisticated database operations on DNA data such as database 
join operations and various massively parallel operations on the DNA data. 

1.2 Cryptography 

Data security and cryptography are critical aspects of conventional computing 
and may also be important to possible DNA database applications. Here we 
provide basic terminology used in cryptography [42]. The goal is to transmit 
a message between a sender and receiver such that an eavesdropper is unable 
to understand it. Plaintext refers to a sequence of characters drawn from a 
finite alphabet, such as that of a natural language. Encryption is the process of 
scrambling the plaintext using a known algorithm and a secret key. The output 
is a sequence of characters known as the ciphertext. Decryption is the reverse 
process, which transforms the encrypted message back to the original form using 
a key. The goal of encryption is to prevent decryption by an adversary who does 
not know the secret key. An unbreakable cryptosystem is one for which successful 
cryptanalysis is not possible. Such a system is the one-time-pad cipher. It gets its 
name from the fact that the sender and receiver each possess identical notepads 
filled with random data. Each piece of data is used once to encrypt a message 
by the sender and to decrypt it by the receiver, after which it is destroyed. 

1.3 Our Results 

This paper investigates a variety of biomolecular methods for encrypting and 
decrypting data that is stored as DNA. In Section 2, we present a class of DNA 
cryptography techniques that are in principle unbreakable. We propose the se- 
cret assembly of a library of one-time-pads in the form of DNA strands, followed 
by a number of methods to use such one-time-pads to encrypt large numbers 
of short message sequences. The use of such encryption with conventional elec- 
tronic media is limited by the large amount of one-time-pad data which must 
be created and transmitted securely. Since DNA can store a significant amount 
of information in a limited physical volume, the use of DNA could mitigate this 
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concern. In Section 3, we present an interesting concrete example of a DNA 
cryptosystem in which a two-dimensional image input is encrypted as a solution 
of DNA strands. We detail how these strands are then decrypted using fluores- 
cent DNA-on-a-chip technology. Section 4 discusses steganographic techniques 
in which the DNA encoding of the plaintext is hidden among other unrelated 
strands rather than actually being encrypted. We analyze a recently published 
genomic steganographic technique [45], where DNA plaintext strands were ap- 
pended with secret keys and then hidden among many other irrelevant strands. 
While the described system is appealing for its simplicity, our entropy based 
analysis allows extraction of the message without knowledge of the secret key. 
We then propose improvements that increase the security of DNA steganography. 



2 DNA Cryptosystems Using Random One-Time-Pads 

One-time-pad encryption uses a codebook of random data to convert plaintext to 
ciphertext. Since the codebook serves as the key, if it were predictable (i.e., not 
random), then an adversary could guess the algorithm that generates the code- 
book, allowing decryption of the message. No piece of data from the codebook 
should ever be used more than once. If it was, then it would leak information 
about the probability distribution of the plaintext, increasing the efficiency of an 
attempt to guess the message. These two principles, true randomness and single 
use of pads, dictate certain features of the DNA sequences and of sequence li- 
braries, which will be discussed further below. This class of cryptosystems using 
a secret random one-time-pad are the only cryptosystems known to be absolutely 
unbreakable [42]. 

We will first assemble a large one-time-pad in the form of a DNA strand, 
which is randomly assembled from short oligonucleotide sequences, then isolated 
and cloned. These one-time-pads will be assumed to be constructed in secret, 
and we further assume that specific one-time-pads are shared in advance by 
both the sender and receiver of the secret message. This assumption requires 
initial communication of the one-time-pad between sender and receiver, which 
is facilitated by the compact nature of DNA. 

We propose two methods whereby a large number of short message sequences 
can be encrypted: (i) the use of substitution, where we encrypt each message 
sequence using an associatively matched piece from the DNA pad; or (ii) the use 
of bit-wise XOR computation using a biomolecular computing technique. The 
decryption is done by similar methods. 

It is imperative that the DNA ciphertext is not contaminated with any of the 
plaintext. In order for this to be effected, the languages used to represent each 
should be mutually exclusive. The simplest way to create mutually exclusive 
languages is to use disjoint plain and ciphertext alphabets. This would facilitate 
the physical separation of plaintext strands from the ciphertext using a set of 
complementary probes. If the ciphertext remains contaminated with residual 
plaintext strands, further purification steps can be utilized, such as the use of 
the DNA-SIEVE described in Section 4.4. 
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2.1 DNA Cryptosystem Using Substitution 

A substitution one-time-pad system uses a plaintext binary message and a table 
defining a random mapping to ciphertext. The input strand is of length n and 
is partitioned into plaintext words of fixed length. The table maps all possible 
plaintext strings of a fixed length to corresponding ciphertext strings, such that 
there is a unique reverse mapping. 

Encryption occurs by substituting each plaintext DNA word with a corre- 
sponding DNA cipher word. The mapping is implemented using a long DNA 
pad that consisting of many segments, each of which specifies a single plaintext 
word to cipher word mapping. The plaintext word acts as a hybridization site 
for the binding of a primer, which is then elongated. This results in the forma- 
tion of a plaintext-ciphertext word-pair. Further, cleavage of the word-pairs and 
removal of the plaintext portion must occur. A potential application is detailed 
in Section 3. 

An ideal one-time-pad library would contain a huge number of pads and 
each would provide a perfectly unique, random mapping from plaintext words to 
cipher words. Our construction procedure approaches these goals. The structure 
of an example pad is given in Figure 1. 



One-Time Pad 
Repeating Unit 



STOP Cm Pm stop Ci P, stop Ci+1 Pi+i 

“jlllllllll 

■■■ ^ I * 

~Pi 



3' 



Fig. 1. One-time-pad Codebook DNA Sequences 



The repeating unit is made up of: (i) one sequence word, Ci , from the set of 
cipher or codebook-matching words; (ii) one sequence word. Pi , from the set of 
plaintext words; and (iii) a polymerase “stopper” sequence. We note that each 
Pi includes a unique subsequence, which prevents frequency analysis attacks by 
mapping multiple instances of the same message plaintext to different ciphertext 
words. Further, this prefix could optionally be used to encode the position of the 
word in the message. 

Each sequence pair i, uniquely associates a plaintext word with a cipher 
word. Oligo with sequence Pi , corresponding to the Watson-Crick complement of 
plaintext word Pi , can be used as polymerase primer and be extended by specific 
attachment of the complement of cipher word Ci. The stopper sequence prohibits 
extension of the growing DNA strand beyond the boundary of the paired cipher 
word. A library of unique codebook strands is constructed using this theme. 
Each individual strand from this codebook library specifies a particular, unique 
set of word pairings. 
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The one-time-pad consists of a DNA strand of length n containing d = 
Li+L 2 -tL 3 copies of the repeating pattern: a cipher word of length L 2 , a plaintext 
word of length Li, and stopper sequence of length L3. We note that word length 
grows logarithmically in the total pad length; specifically Li = c\ log 2 n, L 2 = 
C2 log 2 n, and L3 = C3, for fixed integer constants ci, 02,03 > 1. Each repeat 
unit specifies a single mapping pair, and no codebook word or plaintext word 
will be used more than once on any pad. Therefore, given a cipher word Ci we 
are assured that it maps to only a single plaintext word Pi and vice versa. The 
stopper sequence acts as “punctuation” between repeat units so that DNA poly- 
merase will not be able to continue copying down the template (pad) strand. 
Stopper sequences consist of a run of identical nucleotides which act to halt 
strand copying by DNA polymerase given a lack of complementary nucleotide 
triphosphate in the test tube. For example, the sequence TTTT will act as a 
stop point if the polymerization mixture lacks its base-pairing complement, A. 
Stopper sequences of this variety have been prototyped previously [18]. Given 
this structure, we can anneal primers and extend with polymerase in order to 
generate a set of oligonucleotides corresponding to the plaintext/cipher lexical 
pairings. 

The experimental feasibility depends upon the following factors: (i) the size of 
the lexicon, which is the number of plaintext-ciphertext word-pairs, (ii) the size 
of each word, (iii) the number of DNA one-time-pads that can be constructed in 
a synthesis cycle, and (iv) the length of each message that is to be encrypted. If 
the lexicon used consisted of words of the English language, its size would be in 
the range of 10,000 to 25,000 word-pairs. If for experimental reasons, a smaller 
lexicon is required, then the words used could represent a more basic set such as 
ASCII characters, resulting in a lexicon size of 128. The implicit tradeoff is that 
this would increase message length. We estimate that in a single cloning proce- 
dure, we can produce 10® to 10® different one-time-pad DNA sequences. Choice 
of word encodings must guarantee an acceptable Hamming distance between 
sequences such that the fidelity of annealing is maximized. When generating 
sequences that will represent words, the space of all possibilities is much larger 
than the set that is actually needed for the implementation of the words in the 
lexicon. We also note that if the lexicon is to be split among multiple DNA 
one-time-pads, then care should be taken during pad construction to prevent a 
single word from being mapped to multiple targets. 

If long-PCR with high fidelity enzymes introduces errors and the data in 
question is from an electronic source, we can pre-process it using suitable error- 
correction coding. If instead we are dealing with a wet database, the DNA one- 
time-pad’s size can be restricted. This is done by splitting the single long one- 
time-pad into multiple shorter one-time-pads. In this case each cipher word would 
be modified to include a subsequence prefix that would denote which shorter 
one-time-pad should be used for its decryption. This increases the difficulty of 
cloning the entire set of pads. 

The entire construction process can be repeated to prepare greater numbers 
of unique pads. Construction of the libraries of codebook pads can be approached 
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using segmental assembly procedures used successfully in previous gene library 
construction projects [25,24] and DNA word encoding methods used in DNA 
computation [10,11,40,13,14,32]. One methodology is chemical synthesis of a 
diverse segment set followed by random assembly of pads in solution. An issue 
to consider with this methodology is the difficulty of achieving full coverage 
while avoiding possible conflicts due to repetition of plaintext or cipher words. 
We can set the constants ci and C 2 large enough so that the probability of getting 
repeated words on a pad of length n is acceptably small. 

Another methodology would be to use a commercially available DNA chip 
[12,35,8,6,44]. See [31] for previous use of DNA chips for I/O. The DNA chip has 
an array of immobilized DNA strands, so that multiple copies of a single sequence 
are grouped together in a microscopic pixel. The microscopic arrays of the DNA 
chip are optically addressable, and there is known technology for growing distinct 
DNA sequences at each optically addressable site of the array. Light-directed 
synthesis allows the chemistry of DNA synthesis to be conducted in parallel at 
thousands of locations, i.e., it is a combinatorial synthesis. Therefore, the number 
of sequences prepared far exceeds the number of chemical reactions required. For 
preparation of oligonucleotides of length L, the sequences are synthesized in 
4L chemical reactions. For example, the ~ 65, 000 sequences of length 8 require 
32 synthesis cycles, and the 1.67 x 10^ sequences of length 12 require only 48 
cycles. Each pixel location on the chip comprises 10 microns square, so the 
complete array of 12-mer sequences could be contained within a ~ 4 cm square. 

2.2 DNA XOR One-Time-Pad Cryptosystem 

The Vernam cipher [21] uses a sequence, S, of R uniformly distributed random 
bits known as a one-time-pad. A copy of S is stored at both the sender’s and 
receiver’s locations. L is the number of bits of S that have not yet been used 
for encrypting a message. Initially L = R. XOR operates on two binary inputs, 
yielding 0 if they are the same and 1 otherwise. When a plaintext binary message 
M which is n < L bits long needs to be sent, each bit Mi is XOR’ed with the bit 
Ki = SR-L+i to produce the encrypted bit Q = 0 Ai for i = 1, . . . , n. The 

n bits of S that have been consumed are then destroyed at the source and the 
encrypted sequence C = (Ci, C 2 , . . . , Cn) is dispatched to the destination. At 
the destination the identical process is repeated - that is the sequence C is used 
in the place of M, performing bitwise XOR with bits from S, destroying the bits 
of S after they are consumed. The self-inverse property of binary XOR results in 
the initial message being reproduced since Ci®Ki = Mi and Mj0 Aj0 = Mi. 

To implement this algorithm with DNA, we need methods to (i) encode a 
plaintext message, (ii) create a DNA one-time-pad and (iii) effect bitwise XOR 
in DNA. Several methods exist to effect binary addition and XOR with DNA. In 
1996, [16] prototyped single bit addition. Subsequent proposals [34,17] allowed 
for chaining outputs with inputs, and parallel operations. [22] experimentally 
demonstrated a logically reversible conditional XOR that required 0(n) recom- 
binant DNA operations to act on n bit data. [26] described a specific DNA tiling 
implementation of XOR and addition, based on previous work on self-assembly 
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of DNA tilings [47,46,48,49,50,36]. An example of cumulative XOR using self- 
assembled DNA tilings has recently been published [30] . 

DNA tiles are multi-strand complexes containing two or more double helical 
domains such that individual oligonucleotide chains might base-pair in one helix 
then cross-over and base-pair in another helix. Complexes involving crossovers 
(strand exchange points) create tiles which are multivalent and can have quite 
high thermal stability. Many helix arrangements and strand topologies are pos- 
sible and several have been experimentally tested [28,27]. Tiles with specific 
uncomplemented sticky ends at the corners were constructed, with the purpose 
of effecting self-assembly. 

A binary input string can be encoded using a single tile for each bit. The tiles 
are designed such that they assemble linearly to represent the binary string. The 
use of special connector tiles allow two such linear tile assemblies representing 
two binary input strings respectively, to come together and create a closed frame- 
work within which specially designed output tiles can fit . This process allows for 
unmediated parallel binary addition or XOR. As a result of the special design 
of these tiles, at the end of the process, there exists a single strand that runs 
through the entire assembly which will contain the two inputs and the output 
[27,26]. By using this property, we are able to effect the Vernam cipher in DNA. 



S^\/ I I ^ • 

//n A V V' ..A *♦. 



Fig. 2. TAO triple-helix tile 



In particular, we use TAO triple-helix tiles (see Figure 2). The tile is formed 
by the complementary hybridization of four oligonucleotide strands (shown as 
different line types with arrowheads on their 3' ends). The three double-helical 
domains are shown with horizontal helix axes where the upper and lower helices 
end with bare strand ends and the central helix is capped at both ends with 
hairpins. Paired vertical lines represent crossover points where two strands ex- 
change between two helices. Uncomplemented sticky ends can be appended to 
the four corners of the tile and encode neighbour rules for assembly of larger 
structures including computational complexes. For more details see [27,30]. 
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We outline below how the bit-wise XOR operation may be done (see Fig- 
ure 3 ). For each bit Mi of the message, we construct a sequence ai that will rep- 
resent the bit in a longer input sequence. By using suitable linking sequences, we 
can assemble the message M’s n bits into the sequence 0102 . . . an, which serves 
as the scaffold strand for one binary input to the XOR. The further portion 
of the scaffold strand 0(03 ■ ■ - a'n is created based on random inputs and serves 
as the one-time-pad. It is assumed that many scaffolds of the form a'^a^ ■■ - a'n 
have been initially created, cloned using PCR [ 2 , 39 ] or an appropriate technique, 
and then separated and stored at both the source and destination points in ad- 
vance. When encryption needs to occur at the source, the particular scaffold 
used is noted and communicated using a prefix index tag that both sender and 
destination know corresponds to a particular scaffold. 

By introducing the scaffold for the message, the scaffold for the one-time- 
pad, connector tiles and the various sequences needed to complete the tiles, the 
input portion of the structure shown in Figure 3 forms. We call this the input 
assembly. This process of creating input scaffolds and assembling tiles on the 
scaffold has been carried out successfully [ 26 ] . Each pair of tiles (corresponding 
to a pair of binary inputs) in the input assembly creates a slot for the binding of a 
single output tile. When output tiles are introduced, they bind into appropriate 
binding slots by the matching of sticky ends. 

Finally, adding ligase enzyme results in a continuous reporter strand R that 
runs through the entire assembly. If bi = Oj 0 a', for i = I,...,n, then the 
reporter R = a\a2 ■ ■ ■ a^.a^a^ . . . a'n-bib2 ■ ■ - bn- The reporter strand is shown as 
a dotted line in Figure 3 . This strand may be extracted by first melting apart 
the hydrogen bonding between strands and then purifying by polyacrylamide 
gel electrophoresis. It contains the input message, the encryption key, and the 
ciphertext all linearly concatenated. The ciphertext can be excised using a re- 
striction endonuclease if a cleavage site is encoded between the oq and b\ tiles. 
Alternatively the reporter strand could incorporate a nick at that point by using 
an unphosphorylated oligo between those tiles. The ciphertext could then be gel 
purified since its length would be half that of the remaining sequence. This may 
then be stored in a compact form and sent to a destination. 

Since XOR is its own inverse, the decryption of a Vernam cipher is effected 
by applying the identical process as encryption with the same key. Specifically, 
the b\b2- ■ - bn is used as one input scaffold, the other is chosen from the stored 
a'l tt2 . . . a(j according to the index indicating which sequence was used as the 
encryption key. The sequences for tile reconstitution, the connector tiles, and 
ligase are added. After self-assembly, the reporter strand is excised, purified, cut 
at the marker and the plain text is extracted. 

We need to guard against loss of synchronization between the message and 
the key, which would occur when a bit is spuriously introduced or deleted from 
either sequence. Some fault tolerance is provided by the use of several nucleotides 
to represent each bit in the the tiles’ construction. This reduces the probabiliity 
of such errors. 
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Output string b, , b2 






\ 






.M' 




Input strings a, , a2 ,a3 ,...a„ 
and a’, , a’2 , a% .... a’„ 



Fig. 3 . XOR computation by the use of DNA tiles 



3 Encrypting Images with DNA Chips and DNA 
One-Time-Pads 

3.1 Overview of Method 

In this section we outline a system capable of encryption and decryption of input 
and output data in the form of 2D images recorded using microscopic arrays 
of DNA on a chip. The system we describe here consists of: a data set to be 
encrypted, a chip bearing immobilized DNA strands, and a library of one-time- 
pads encoded on long DNA strands as described in Section 2.1. The data set for 
encryption in this specific example is a 2-dimensional image, but variations on 
the method may be useful for encoding and encrypting other forms of data or 
types of information. The DNA chip contains an addressable array of nucleotide 
sequences immobilized such that multiple copies of a single sequence are grouped 
together in a microscopic pixel. Such DNA chips are currently commercially 
available and chemical methods for construction of custom variants are well 
developed. Further chip details will be given below. 




Fig. 4. DNA Chip Input/Output : Panel A: Message, Panel B: Encrypted Mes- 
sage, Panel C: Decrypted Message 
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Figure 4 gives a coarse grained outline of the I/O method. Fluorescently la- 
beled, word-pair DNA strands are prepared from a substitution pad codebook 
as described in Section 2.1. These are annealed to their sequence complements 
at unique sites (pixels) on the DNA chip. The message information (Panel A) is 
transferred to a photo mask with transparent (white) and opaque (black) regions. 
Following a light-flash of the mask-protected chip, the annealed oligonucleotides 
beneath the transparent regions are cleaved at a photo-labile position. Their 5' 
sections dissociate from the annealed 3' section and are collected in solution. 
This test tube of fluorescently labeled strands is the encrypted message. An- 
nealed oligos beneath the opaque regions are unaffected by the light-flash and 
can be subsequently washed off the chip and discarded. If the encrypted message 
oligos are reannealed onto a different DNA chip, they would anneal to different 
locations and the message information would be unreadable (Panel B). Note that 
if one used a chip identical to the encoding chip, and if the sequence lexicons 
for 5' segment (cipher word) and 3' segment (plaintext word) are disjoint, no 
binding would occur and the chip in Panel B would be completely black. De- 
crypting is accomplished by using the fluorescently labeled oligos as primers in 
asymmetric PCR with the same one-time codebook which was used to prepare 
the initial word-pair oligos. When the word-pair PCR product is bound to the 
same DNA chip, the decrypted message is revealed (Panel C). 
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Fig. 5. Components and Organization of the DNA Chip 



The annealed DNA in Figure 5 corresponds to the word-pair strands pre- 
pared from a random substitution pad as described in Section 2.1 above. Immo- 
bile DNA strands are located on the glass substrate of the chip in a sequence 
addressable grid according to currently used techniques. Ciphertext-plaintext 
word-pair strands anneal to the immobile ones. The annealed strand contains a 
fluorescent label on its 5' end (denoted with an asterisk in the figure). This is 
followed by the complement of a plaintext word (uncomplemented section) and 
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the complement of a cipher word (complemented section) . Located between the 
two words is a photo-cleavable base analog (white box in the figure) capable of 
cleaving the backbone of the oligonucleotide. 

Figure 6 gives step by step procedures for encryption and decryption. For 
encryption, we start with a DNA chip displaying the sequences drawn from the 
cipher lexicon. In step one, the fiuorescently labeled word-pair strands prepared 
from a one-time-pad are annealed to the chip at the pixel bearing the complement 
to their 3' end. In the next step, the mask (heavy black bar) protects some pixels 
from a light-flash. At unprotected regions, the DNA backbone is cleaved at a 
site between the plaintext and cipher words. In the final step, the 5' segments, 
still labeled with fiuorophore at their 5' ends, are collected and transmitted as 
the encrypted message. 



Encryption Scheme 

DNA chip 

I Decryption Scheme 

I Anneal labeled DNA. 

T t I * 

I Encoded message DNA 

^ ^ ^ — L_LLJ — I — I — I — I codebook DNA. 

I ▼ 

I Mask and flash. 

I Extend with DNA polymerase. 

I Isolate word pair strands. 

ih'i'i'i'i! I ! !| 1 

1 Anneal onto DNA chip. 

Ill I I I I I Decoded message for 

I I ' Encoded message DNA | | | | | |: |: |: |: |: nuorescent read.out 

Fig. 6. Step by step procedures for encryption and decryption 



A message can be decrypted only by using the one-time-pad and DNA chip 
identical to those used in the encryption process. First, the word-pair strands 
must be reconstructed by appending the proper cipher word onto each plain- 
text word. This is accomplished by primer extension or asymmetric PCR using 
transmitted strands as primer and one-time-pad as template. The strands bind 
to their specific locations on the pad and are appended with their proper ci- 
pher partner. Note that in the decrypting process the fluorescent label is still 
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required, but the photo-labile base is unnecessary and not present. The final step 
of decryption involves binding the reformed word-pair strands to the DNA chip 
and reading the message by fluorescent microscopy. 

3.2 Additional Technical Considerations 

Some details concerning the configuration of the DNA chip should be mentioned. 
In the current incarnation of the method, reverse synthesis of oligos directly on 
the chip or “spot attachment” would be required. Chemical reagents for reverse 
synthesis are commercially available although not as widely used as those for 
3’-to-5’ methods. Spot attachment of oligos onto the chip results in decreased 
pixel density and increased work. However, recent chip improvements, including 
etching with hydrophobic gridlines, may alleviate this drawback. 

One potential photo-cleavable base analog is 7— nitroindole nucleoside. It has 
previously been shown to produce a chemical lesion in the effected DNA strand 
which causes backbone cleavage. Use of 7— nitroindole nucleoside for strand cleav- 
age has been shown to produce useable 5' and 3' ends [23] . Production of ’clean’ 
3^ ends is critical for decrypting the message, since the cipher strands must hy- 
bridize to the one-time-pad and act as primers for the polymerase mediated 
strand elongation (PCR). Primers and templates containing the 7— nitroindole 
nucleoside have been shown to function properly in PCR and other enzymatic 
reactions. 



4 DNA Steganography Analysis 

Steganography using DNA is appealling due to its simplicity. One method pro- 
posed involves taking “plaintext” input DNA strands, tagging each with “secret 
key” strands, and then hiding them among random “distracter” strands. The 
plaintext is retrieved by hybridization with the complement of the secret key 
strands. It has been postulated that in the absence of knowledge of the secret 
key, it would be necessary to examine all the strands including the distracters to 
retrieve the plaintext. Based on the likely difference in entropy of the distracters 
and the plaintext, we argue that the message can be retrieved without the key. 

4.1 Relevant Data Compression Result 

The compression ratio is the quotient of the length of the compressed data 
divided by the length of the source data. For example, many images may be 
losslessly compressed to a 1/4 of their original size; English text and computer 
programs have compression ratios of about 1/3; most DNA has a compression 
ratio between 1/2 and 1/1.2 [15,29]. Protein coding sequences make efficient use 
of amino acid coding and have larger compression ratios [33,20]. The Shannon 
information theoretic entropy rate is denoted by Hs < 1. It is defined to be 
the rate that the entropy increases per symbol, for a sequence of length n with 
n oo [9]. It provides a measure of the asymptotic rate at which a source 
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can be compressed without loss of information. Random sequences can not be 
compressed and therefore have an entropy rate of 1. 

Lossless data compression with an algorithm such as Lempel-Ziv [51], is the- 
oretically asymptotically optimal. For sequences whose length n is large, the 
compression ratio approaches the entropy rate of the source. In particular, it 
is of the form (1 -|- e{n))Hs, where e(n) — > 0 for n oo. Algorithms such 
as Lempel-Ziv build an indexed dictionary of all subsequences parsed that can 
not be constructed as a catenation of current dictionary entries. Compression is 
performed by sequentially parsing the input text, finding maximal length sub- 
sequences already present in the dictionary, and outputting their index number 
instead. When a subsequence is not found in the dictionary, it is added to it (in- 
cluding the case of single symbols). Algorithms can achieve better compression 
by making assumptions about the distribution of the data [4]. It is possible to 
use a dictionary of bounded size, consisting of the most frequent subsequences. 
Experimentation on a wide variety of text sources shows that this method can 
be used to achieve compression within a small percentage of the ideal [43]. In 
particular, the compression ratio is of the form (1 -|- e)Hs, for a small constant 
e > 0 typically of at most 1/10 if the source length is large. 

Lemma 1. The expected length of a parsed word is between and L, where 



Proof. Assume the source data has an alphabet of size b. An alphabet of the 
same size can be used for the compressed data. The dictionary can be limited to 
a constant size. We can choose an algorithm that achieves a compression ratio 
within 1 -|- e of the asymptotic optimal, for a small e > 0. Therefore, for large n, 
we can expect the compression ratio to approach (1 -|- e)Hs- 

Each parsed word is represented by an index into the dictionary, and so its size 
would be logj n if the source had no redundancy. By the choice of compression 
algorithm, the length of the compressed data is between Hs and Hs{l + e) times 
the length of the original data. From these two facts, it follows that the expected 
length of a code word will be between and . 



Lemma 2. A parsed word has length < ^ with probability > 1 — p. 

Proof. The probability of a parsed word having length > ^ is < p, for all p G 
(0, 1), by the Markov inequality. The lemma follows from this. 



Lemma 3. A parsed word has length > c'L with probability > 1 — p, if p > 
1 - 7 ( 1 ^ c' = c- > 0. 

Proof. The maximum length of a parsed word has an upper bound in practice. 
We assume that this is cL for a small constant c > 1. We use A to denote 
the difference between the maximum possible and the actual length of a parsed 
word, and A to denote the difference’s expected value. Applying Lemma 1, 
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0 < Z\ < cL — Y^(= (c — The probability that A > ^(= is 

< p, by the Markov inequality. Therefore, with probability < p, a parsed word 

has length < cL — ^ = c'L, where d = c — We choose p > 1 — 

so that 0 < c' < c, since parsed words must have positive length that does not 
exceed the maximum postulated. 



Lemma 4. A parsed word has length between d L and T with probability > 
(1 - P)^, «/p > 1 - and d > Hs- 

Proof. This follows from Lemmas 2 and 3. 



4.2 Analysis Assumptions 

We assume all the following. The alphabet in question is that of DNA and 
therefore has size 4. The probability distribution of the “plaintext” DNA source 
S is known - for example, that it is generated by a stationary ergodic process. The 
“distracter” DNA strands have a random uniform distribution over the 4 DNA 
bases. Both “plaintext” and “distracter” DNA strands have the same length since 
they may otherwise be distinguished by length. A Lempel-Ziv algorithm variant 
that meets the criteria of Lemma 4 is known, p is fixed just above 1 — 

/(n) « g{n) if ^ 1 and (1 - i)” « i, for large n. 



4.3 Constructing a Dictionary 

Let L = Hs log 4 n. D is the set of d most frequently occurring words of the 
source, where d is the size of the dictionary. D' is the subset of D that consists 
of words that meet the following two criteria. The first is that the word’s length 
must be between dL and T. The second is that the word’s frequency in the 
source S must be > where n' = (1 — p)^^. 

Lemma 5. The probability that a word w in D' is a parsed word of the “plain- 
text” DNA sequence is > 1 — 

Proof. Let A be a “plaintext” DNA sequence of length n. Consider D” , the 
subset of D containing words of length between dL and D" contains at least 
(1 — p)^ of the parsed words of X by Lemma 4. D' is the subset of D” which 
consists of only words that have frequency > ^ . Consider a word v parsed from 
X . The probability that a word w from D' is not f is < 1 — ^ by construction. 
X has an expected number ^ parsed words. By Lemma 1, there are an expected 
number (1 — p)^^ words with length in the range between dL and T. The 

probability that w is none of these words is therefore < (1 — ® ~ i- 

Thus, a word w in D' is some parsed word of X with probability > 1 — -j. 
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4.4 DNA-SIEVE: Plaintext Extraction Without the Key 

Strand separation operates on a test tube of DNA, using recombinant DNA 
technology to split the contents into two tubes, T and T' with separation error 
rates cr“, ct'*' : 0 < cr“, ct'*' < 1. The goal is to transfer all the strands that contain 
the subsequence w into the tube T', leaving all the rest in tube T. A fraction 
< (T~ of the strands without subsequence w enter T' . A fraction > 1 — cr'*' of 
the strands containing w are left in T. We assume that p = (j— yy^3Xy,0 < 

p < 1. Modest expectations for separation technology yield 0 < cr+ < 0.2 and 
0 < (7+ < 0.05. Using cr“ = = 0.2 suffices to obtain p in the desired range. 

DNA-SIEVE is to be used to extract the “plaintext” DNA strands from 
the mix in which there are many “distracters” . It begins with a tube T. The 
separate operation is iteratively applied. In each round, a previously unused 
word w from the set D' is chosen. All strands that contain it are retained by 
using hybridization with the complement of w. We use r(T) to denote the ratio 
of the distracters to the plaintext, and F{T) to denote the tube from which the 
strands with subsequence w were removed. 

4.5 DNA-SIEVE Analysis 

The success of DNA-SIEVE rests on the fact that a word in D' is likely to occur 
in plaintext X with probability 1 — while it is expected to occur in a random 
text R with probability close to 0. 

Lemma 6. The prohahility that a word in D' is a subsequence of R is « n4“'^ ^ = 
1 



Proof. Let R denote a random “distracter” sequence of length n over the al- 
phabet of the 4 DNA bases. Since all sequences are equiprobable, one of length 

— c' 

c' L = d is likely to occur with probability ^ Since 

n ^ 

it can occur at any of ~ n locations in i?, the probability of it occurring in R is 
^4-c L _ — _i — gy assumption in Lemma 4, d > iLg, so — 1 > 0. 

Lemma 7. If DNA-SIEVE operates on tube T and results in tube F(T), then at 
the most (7~ of the distracters in T are in F{T), while atleast~ (1 — cr+)(l— -1) 
of the plaintext strands ofT are in F{T). 

Proof. The probability that a distracter strand in T is present in F(fF) is limited 
by a~ ^ , the sum of the error and the theoretical chance. By assumption, 

the error rate is < a~ . By the Lemma 6 the chance is < — } — . Since n is large, 

this is « 0. Therefore at most a~ of the distracters in T reach F{T). Similarly, 
by Lemma 5, at least 1 — ^ of the plaintext strands in T are expected to be in 
F{T). By assumption, at most cr'*' of the strands that should reach F{T) are left 
behind due to separation error. Therefore, « (1 — cr’'’)(l — ^) of the plaintext 
strands actually reach F{T). 
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Lemma 8. The probability distribution of the distracter strands returns to the 
original one after an expected number of ^ iterations of DNA-SIEVE. 

Proof. Each iteration of DNA-SIEVE uses a word w from D' that has not been 
previously used. Assume w is a prefix or suffix of another word w' in D' . Once 
DNA-SIEVE has been using w, the probability distribution of the distracters 
complementary to w' will be altered. The distribution of the rest will remain 
unaffected. There are < ^ words that can overlap with a given word from D' . 
Therefore, a particular separation affects < ^ other iterations. If w is chosen 
randomly from D' , then after an expected number of ^ iterations, all the strands 
will be equally effected and hence the probability distribution of the distracters 
will be the same as before the sequence of iterations. Such a number of iterations 
is termed “independent” . 



Theorem 1. To reduce the ratio of distracter to plaintext strands by a factor r, 
it suffices to apply DNA-SIEVE an expected number o/ 0(log n) log r times. 

Proof. Denote the ratio of the distracter strands to the plaintext strands in test 
tube T with r' . Now consider a tube F{T) that results from applying DNA- 
SIEVE t times till this ratio has been reduced by a factor r. Denote the ratio 
for tube F{T) by r" . By Lemma 8, after an expected number of ^ iterations, 
a test tube G{T) is produced with the same distribution of distracters as in T. 
Applying Lemma 7, we expect that after every set of ^ iterations, the ratio will 

change by at least p = i) ■ We expect a decrease in the concentration 

^ t 

~TT7 n 

after t iterations by a factor of . To attain a decrease of we need t = 
^ log ^ logp. Since L = O(logn) and p = 0(1), t = O(logn) logr. 

4.6 DNA-SIEVE Implementation Considerations 

The theoretical analysis of DNA-SIEVE was used to justify the expected geoe- 
metric decrease in the conentration of the distracter strands. It also provides 
two further insights. The number of “plaintext” DNA strands may decrease by 
a factor of(l — cr“'')(l — 1/e) after each iteration. It is therefore prudent to in- 
crease the absolute number of copies periodically (by ligating all the strands 
with primer sequences, PCR cycling, and then digesting the primers that were 
added). The number of iterations that can be performed is limited to n' due to 
the fact that a distinct word from D' must be used each time. This limits the 
procedure to operation on a population where the number of distracter strands 
is < 4A' . 



4.7 Empirical Analysis 

We performed an empirical analysis of DNA-SIEVE. We assumed that the test 
tube would start with 10® distracter strands and 10® message strands. The first 
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Separation^ Error in ^ 



Fig. 7. Simulation of DNA-SIEVE : Distracters have a sharper drop-off in con- 
centration 



parameter varied was the number of “indpendent” iterations of DNA-SIEVE - 
from 1 to 10. The second parameter varied was the separation error rate - from 
0.01 to 0.25 in multiples of 0.01. Here we do not assume a difference in the error 
rate for false positives and false negatives that occur during the separation. In 
each case, the number of distracters and message strands remaining was com- 
puted. The results are plotted in Figure 7. From this we can see that 5 to 10 
iterations of DNA-SIEVE suffice to reduce the distracter population’s size to 
below that of the message strands when the separation error is < 0.18. The 
table illustrates the number of distracters and message strands left after 3, 6 
and 9 iterations with varying separation error rates. If the error rate is reason- 
able, it can be seen from the table that there remain enough message strands 
for the plaintext to be found. If the separation error rate is high, the number 
of strands used must be increased to allow enough iterations of DNA-SIEVE to 
occur before the message strands are all lost. 
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4.8 Improving DNA Steganography 

We can improve a DNA steganography system by reducing the difference be- 
tween the plaintext and distracter strands. This can be done by making the 
distracters similar to the plaintext by creating them using random assembly of 
elements from the dictionary D. Alternatively, DNA-SIEVE can be employed 
on a set of random distracters to shape the population into one whose distri- 
bution matches that of the plaintext. We note, however, that if the relative 
entropy [9] between the plaintext and the distracter strand populations is large 
enough, DNA-SIEVE can be employed as previously described. An attacker can 
use a larger dictionary, which provides a better model of the plaintext source, 
to increase the relative entropy re-enabling the use of DNA-SIEVE. If the M 
plaintext strands are tagged with a sequence that allows them to be extracted, 
then they may be recognized by the high frequency of the tag sequence in the 
population. To guard against this, N sets of M strands each can be mixed in. 
This results in a system that uses V = 0{MN) volume. To prevent a brute 
force attack, N must be large, potentially detracting from the practicality of 
using using the DNA steganographic system. 

The other approach to reduce the distinguishability of the plaintext from the 
distracters is to make the former mimic the latter. By compressing the plaintext 
with a lossless algorithm, such as Lempel-Ziv [51], the relative entropy of the 
message and the distracter populations can be reduced. If the plaintext is derived 
from an electronic source, it can be compressed in a preprocessing step. If the 
source is natural DNA, it can be recoded using a substitution method similar to 
the one described in Section 2. However, the security of such a recoding remains 
unknown. In the case of natural DNA, for equivalent effort, DNA cryptography 
offers a more secure alternative to DNA steganography. 

5 Conclusion 

This paper presented an initial investigation into the use of DNA-based infor- 
mation security. We discussed two classes of methods: (i) DNA cyptography 
methods based on DNA one-time-pads, and (ii) DNA steganography methods. 
Our DNA substitution and XOR methods are based on one-time-pads, which are 
in principle unbreakable. We described our implementation of DNA cyptography 
with 2D input/output. We showed that a class of DNA steganography methods 
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offer limited security and can be broken with a reasonable assumption about the 
entropy of the plaintext messages. We considered modified DNA steganographic 
systems with improved security. Steganographic techniques rest on the assump- 
tion that the adversary is unaware of the existence of the data. When this does 
not hold, DNA cryptography must be relied upon. 
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Abstract. We consider the result of a wet splicing procedure after the 
reaction has run to its completion, or limit, and we try to describe 
the molecules that will be present at this final stage. In language theo- 
retic terms the splicing procedure is modeled as an H system, and the 
molecules that we want to consider correspond to a subset of the splicing 
language which we call the limit language. We give a number of examples, 
including one based on differential equations, and we propose a definition 
for the limit language. With this definition we prove that a language is 
regular if and only if it is the limit language of a reflexive and symmetric 
splicing system. 



1 Introduction 

Tom Head [5] invented the notion of a splicing system in 1987 in an attempt to 
model certain biochemical reactions involving cutting and reconnecting strands 
of double-sided DNA. Since then there have been many developments and exten- 
sions of the basic theory, but some of the basic questions have remained unan- 
swered. For example, there is still no simple characterization of the languages 
dehned by splicing systems. 

The language dehned by a splicing system is called the splicing language, 
and this corresponds to the set of molecular types that are created during the 
evolution of the system. In discussions several years ago Tom proposed analyzing 
the outcome of a splicing system in a way that is closer to what is actually 
observed in the laboratory. The idea is that certain molecules which appear 
during the biochemical reactions are transient in the sense that they are used 
up by the time the splicing experiment has “run to completion.” The molecules 
that are left at this stage can be termed the limit molecules of the system, and 
the corresponding formal language defined by the splicing system is called the 
limit language. 

The first major problem is to properly define this limit language. One pos- 
sibility is to try to use differential equations to model the set of molecules, and 
then to define the limit language in terms of the solutions. Although it is reason- 
ably clear how to set up such equations, solving them is another matter, since 
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the system of differential equations is non-linear and, in non-trivial cases, has 
infinitely many variables. 

Another possibility is to try to avoid quantitative issues and to analyze the 
splicing process qualitatively, in the context of formal languages, but guided by 
our understanding of the actual behavior of the biochemical systems. 

In this paper we present a number of examples from the second viewpoint, 
illustrating some of the features that we expect in a limit language. We also give 
some of the details of a simple example from the differential equations viewpoint, 
to illustrate the general approach. 

We then propose a definition of limit language in the formal language setting. 
It is not clear that our definition covers all phenomena of limit behavior, but 
it is sufficient for the range of examples we present, and leads to some interest- 
ing questions. It is well known that splicing languages (based on finite splicing 
systems) are regular, and it is natural to ask whether the same is true for limit 
languages. We do not know the answer in general, but in the special case of a 
reflexive splicing system we can give a satisfying answer: The limit language is 
regular; and, moreover, every regular language occurs as a limit language. 

We present this paper not as a definitive statement, but as a first step toward 
understanding the limit notion. 

2 Motivation 

We review briefly the basic definitions; for more details see [4]. A splicing rule 
is a quadruple r = {ui,vi; U2, V2) of strings, and it may be used to splice strings 
xiUiViDi and X2U2V2y2 to produce xiUiV2U2- A splicing scheme or H scheme is 
a pair ct = {A, R) where A is the alphabet and i? is a finite set of rules over A. 
A splicing system or H system is a pair (a, I) where ct is a splicing scheme and 
/ C A* is a finite set, the initial language. The splicing language defined by such 
a splicing system is the smallest language L that contains I and is closed under 
splicing using rules of cr. Usually L is written as cr*{I). 

Associated to a rule r = (ui,ui; U2,V2) are three other rules: its symmetric 
twin f = {u2,V2', ui,vi) and its reflexive restrictions f = ui,vi) and 

^ = {u2,V2', U2,V2). A splicing scheme cr is called symmetric if it contains the 
symmetric twins of its rules, and it is called reflexive if it contains the reflexive 
restrictions of its rules. 

We consider I as modeling an initial set of DNA molecule types, with cr mod- 
eling a test-tube environment in which several restriction enzymes and ligases act 
simultaneously. In this case the language L models the set of all molecule types 
which appear at any stage in this reaction. In this interpretation we can see that, 
in fact, representatives of all molecule types in L appear very quickly (although 
perhaps in very small concentrations) after the beginning of the reactions. 

Here we want to consider a somewhat different scenario: We ask not for the set 
of all molecules that are created in this reaction, but for the set of molecules that 
will be present after the system has “reached equilibrium” . In other words, we 
expect that some of the molecular types will be used up in constructing different 
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molecular types, so they will eventually disappear from the solution. We call the 
subset Loo of L corresponding to the remaining “final-state” molecules the limit 
set of the splicing system. In this section we consider several examples of such 
limit sets, without giving a formal definition. 

Example 1. Initial language I = {ab,cd}, rule r = {a,b\ c,d), together with its 
symmetric and reflexive versions. Then the splicing language is L = {ab,cd,ad, 
cb} but the limit language is Loo = {ad, cb}. 

Discussion. One of the first wet-lab experiments validating the splicing model 
was described in [7] . This example is just the symbolic version of that experiment . 
The molecules ab and cd are spliced by rules r and r to produce ad and be; they 
are also spliced to themselves by rules r and f to regenerate themselves. The 
words ad and be are “inert” in the sense that they do not participate in any 
further splicings, and any inert words will obviously appear in the limit language. 
However, initial stocks of the initial strings ab and ed will eventually be used 
up (if they are present in equal numbers), so they will not appear in the limit 
language, and the limit language should be just { ad, eb }. We call the words that 
eventually disappear “transient” . Of course, if there are more molecules of ab 
than of cd present at the start then we should expect that there will also be 
molecules of ab left in the final state of the reaction. 

Our definition of limit language will ignore this possibility of unbalanced 
numbers of molecules available for various reactions. Put simply, we will define 
the limit language to be the words that are predicted to appear if some amount 
of each initial molecule is present, and sufficient time has passed for the reaction 
to reach its equilibrium state, regardless of the balance of the reactants in a 
particular experimental run of the reaction. In the example above, one particular 
run of the reaction may yield molecules of ab at equilibrium, while another run 
may not yield molecules of ab at equilibrium, but rather molecules of cd. Both 
reactions, however, are predicted to yield molecules of ad and be, and hence these 
molecules constitute the limit language. 

Limit languages would be easy to analyze if they consisted only of inert 
strings. The following example involves just a slight modification to the previous 
example and provides a limit language without inert words. 

Example 2. Initial language 1 = {ab,cd}, rules ri = (a,b; c,d) and T 2 = 
{a,d; c,b), together with their symmetric twins and reflexive restrictions. Then 
there are no inert words and the splicing language and limit language are the 
same, L = Loo = { ab, cd, ad, cb}. 

Discussion. The molecules ab and cd are spliced by rules r\ and its symmetric 
twin as in Example 1 producing ad and cb. These products in turn can be 
spliced using V 2 and its symmetric twin to produce ab and cd. As in Example 1 
each molecule can regenerate itself using one of the reflexive rules. None of the 
strings are inert, nor do any of them disappear. Thus each molecule in Loo is 
considered to be in a reactive “steady-state” at equilibrium. Their distribution 
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can be calculated using differential equations that model the chemical reactions 
for the restriction enzymes, ligase, and initial molecules involved. 

The next example demonstrates the possibility of an infinite number of re- 
active steady state molecules at equilibrium. 

Example 3. Initial language / = { aa, aaa }, rule r = (a, a; a, a). Then there are 
no inert words and the splicing language and the limit language are the same, 
L = Lao = aa~^. 

Discussion. It is easy to see that L = aa'^ . Moreover, each string in L is the 
result of splicing two other strings of L; in fact, there are infinitely many pairs 
of strings in L which can be spliced to regenerate . In effect, all copies of the 
symbol a are shuffled between the various molecular types m L. We consider 
each word of L to be in the limit language. This interpretation is buttressed 
by the example in Section 3, which calculates the limiting concentrations of the 
molecules in a very similar system and finds that all limiting concentrations are 
positive. 

Our definition of limit language will avoid the detailed calculation of limiting 
distribution; in cases like this we will be content to note that any molecule 
will be present in the limit. 

The following illustrates a very different phenomenon: Molecules disappear 
by growing too big. 

Example 4- Initial language / = {abc}, splicing rule r = (5, c; a, 6). Then the 
splicing language is L = ab'^c but the limit language is empty. 

Discussion. The calculation of the splicing language is straightforward; note 
that splicing ab^c and aVc produces ab^^^c. Hence abc is not the result of any 
splicing operation although it is used in constructing other molecules. Therefore 
all molecules of type abc will eventually be used up, so abc cannot appear in 
the limit language. But now, once all molecules of type abc have disappeared, 
then there is no way to recreate ab^c and ab^c by splicing the remaining strings 
ab^c, k > 2. Hence all molecules of types ab^c or ab^c will eventually be used 
up, so they cannot appear in the limit language. The remainder of L is analyzed 
similarly, using induction. 

The H scheme in Example 4 is neither reflexive nor symmetric, so it is hard to 
justify this example as a model of an actual chemical process. In fact, the next 
example shows that this phenomenon, in which molecules disappear by “con- 
verging to infinitely long molecules” , can also happen in reflexive and symmetric 
H systems. However, we shall see later (Corollary 1) that in the reflexive case 
the limit language is not empty unless the initial language is empty. 

Example 5. Initial language / = { abc }, splicing rule r = (6, c; a, b) together 
with its symmetric twin and its reflexive restrictions. Then the splicing language 
is L = ab*c and the limit language is ac. 
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Discussion. The calculation of the splicing language follows as in Example 4 , 
and all molecules of type ab^c for k > 0 will again eventually disappear under 
splicing using r. The symmetric splice using f generates the single inert string 
ac, which will eventually increase in quantity until it consumes virtually all of 
the a and c pieces. Thus this system involves an infinite set of transient strings 
and a finite limit language consisting of a single inert string. 

This system could run as a dynamic wet lab experiment, and one would 
expect to be able to detect by gel electrophoresis an increasingly dark band 
indicating the presence of more and more ac over time. Over the same time 
frame we predict the strings of the type ab^c would get very long, and this would 
eventually cause these molecules to remain at the top of the gel. In theory, at 
the final state of the reaction, there would be one very long such molecule that 
consumed all of the b pieces. We might expect the dynamics of the system to 
make it increasingly difficult for molecules of ab^c to be formed as k gets very 
large. A wet splicing experiment designed to demonstrate the actual dynamics 
of such a system would be a logical step to test the accuracy of this model of 
limit languages. 



It is essential for the dynamic models that are introduced in Section 3 
that we use a more detailed version of splicing. The action of splicing two 
molecules represented as wi = xiUiViyi and W2 = X2U2V2II2 using the splic- 
ing rule ^2,^2) to produce z = xiU\V2y2 is actually a three step process. 

A restriction enzyme first acts on wi, cutting it between u\ and v\. The resulting 
fragments have “sticky ends” corresponding to the newly cut ends, in the sense 
that a number of the bonds between base pairs at the cut site are broken, so part 
of the cut end on one side is actually single stranded, with the complementary 
bases on the other fragment. Similarly the molecule W2 is cut between U2 and 
U3, generating two fragments with sticky ends. Finally, if the fragments corre- 
sponding to x\Ui and V2V2 meet in the presence of a ligase, they will join at the 
sticky ends to form z. It is clear that any dynamic analysis of the situation must 
account for the various fragments, keeping track of sticky ends. 

There are two primary techniques for this: cut-and-paste systems as defined 
by Pixton [8], and cutting/recombination systems as defined by Freund and his 
coworkers [ 2 ]. The two approaches are essentially equivalent and, in this context, 
differ only in how they handle the “sticky ends” . 

We shall use cut-and-paste. In this formulation there are special symbols 
called “end markers” which appear at the ends of all strings in the system 
and are designed to encode the sticky ends, as well as the ends of “complete” 
molecules. The cutting and pasting rules are themselves encoded as strings with 
end markers. A cutting rule az /3 encodes the cutting action xzy xa + I 3 y, 
while a pasting rule awP corresponds to the pasting action xa + f 3 y xwy. 
Thus a splicing rule {ui,vi; ^2,^2) can be represented by cutting rules aiUiViPi 
and 02^2^2/92 and a pasting rule aiUiV2P2- In this case a\ encodes the sticky 
end on the left half after cutting between ui and vi and P2 corresponds to the 
sticky end on the right half after cutting between U2 and V2, so the pasting rule 
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reconstitutes U 1 V 2 when the sticky ends ai and P 2 are reattached. For details 
see [8]. 

Example 6. This is a cut-and-paste version of Example 5. The set of endmarkers 
is { a, /3i, /? 2 , 7 , 5 }, the set of cutting rules is C = { aabfii, j32bc^ } and the set of 
pasting rules is P = C U { P 2 bbPi, aacj }, and the initial set is { 5abc6 }. Then 
the full splicing language is L = Sab*c6 + <5a + 7<5 + Sab* (32 + f3ib*c"/ + (3ib*(32, 
and the limit language is Loo = { SacS }. 

Discussion. SacS is inert and is created from Sa and ^S by pasting. Thus the 
strings Sa and 7<5 will be eventually used up. These strings are obtained by 
cutting operations on the strings in Sab'^cS + Sab* (32 + (3ib*cj, so these in turn 
are transient. The remaining strings are in (3ib*f32, and these are never cut but 
grow in length under pasting using the rule x ^2 + Piy xbby. Hence these 
strings disappear in the same fashion as in Example 4. 

One possibility that we have not yet discussed is that a DNA fragment may 
produce a circular string by self pasting, if both of its ends are sticky and there 
is an appropriate pasting rule. We use "w to indicate the circular version of the 
string w. 

Example 7. If we permit circular splicing in Example 6, then the limit language 
is Loo = { SacS } U { '6^: /c > 2 } . 

Discussion. The only strings that can circularize are in (3ib'^(32, and if we self 
paste (3ib3 P 2 using the pasting rule (32bb(3i, then we obtain "6^+^. These circular 
strings cannot be cut, so they are inert. 

Our final examples illustrate somewhat more complex dynamics. 

Example 8. Initial language I = {abb,cdd} and splicing rules r\ = (ab,b; a,b), 
T 2 = (c,d; a,b), = (ad,d; a,d), together with their symmetric and reflexive 

counterparts. The splicing language is L = ab~^ + + cb~^ + cdd and the limit 

language is Lao = o,d~^ + cb~^. 

Discussion. Rule ri is used to “pump” ab~^ from abb while rule O produces ab 
from two copies of abb. Rules T 2 and f 2 splice copies of cdd and ab^ to produce 
cb^ and add. Rule rs generates add~^ from add, while generates ad from 
two copies of add. The transient words cdd and ab^ are used up to produce 
add and c6+, and the inert words cb*^ are limit words. However all words of 
ad~^ are continually recreated from other words in ad~^ using rules r^, fs and 
their reflexive counterparts (in the same manner as in Example 3), so these are 
also limit words. Thus, in this example there are infinitely many inert strings, 
infinitely many steady-state strings, and infinitely many transient strings. 

It is not hard to extend this example to have several “levels” of transient 
behavior. 




Splicing to the Limit 



195 



Example 9 . The splicing rules are 

ri = (a, b; c, d), ra = (a, d; f, b), rg = (a, e; a, e), T4 = (a, /; a, /) 

together with their symmetric twins and reflexive restrictions, and the initial 
language / consists of 

wi = cd, W2 = ad, W3 = afd, W4 = afbae, W5 = cbae, wq = abac 
Then the splicing language L is the same as I, but the limit language is {tcs, W4, 

W 5 ,we}. 

Discussion. We shall use a notation that will be made more precise later: For w 
and z in L we write w ^ z to mean that a splicing operation involving w produces 
z. It is easy to check that there is a cycle W2 ^ W3 ^ IU4 ^ W5 ^ wq ^ W2 so 
all these words would seem to be “steady-state” words. On the other hand, 
can be generated only by splicing w\ and wi using fi = (c, d; c,d), but wi also 
participates in splicing operations that lead to the W2-W3 cycle. For example, 
Wq and wi splice using ri to produce W2, so w\ ^ W2- 

Clearly then w\ will be eventually used up, and so it should not be part of 
the limit language. We let Li = L \ { wi }, and reconsider the limit possibilities. 
There are only two ways to produce W2 by splicing: from W2 and W2 using 
fi = (a, 6; a,b) or from wq and w\ using ri. Since w\ is not available in Li 
we see that the W 2 ~wq cycle is broken. In fact, since W2 is used in the splicing 
W2 ^ W3, it must eventually disappear, so it is not part of the limit language. 
The remaining words do not form a cycle but they are connected as follows: 
W3 ^ W4 ^ W3 using T4 and W4 ^ wq ^ wq ^ W4 using rs. Thus these words 
appear to be limit words; and, since the splicings in these cycles do not involve 
W2, they remain limit words after W2 is removed. 

With somewhat more work we can construct a splicing language in which 
the words Wk are replaced by infinite sublanguages Wk, but demonstrating the 
same phenomenon: The disappearance of a set of transients like W\ can change 
the apparent limit behavior of the remaining strings. 

3 Dynamics 

We present here a dynamical model based on a very simple cut-and-paste system. 
The molecules all have the form for j > 0, where F is a fixed fragment of 
DNA. We suppose that our system contains a restriction enzyme which cuts 
between two copies of F and a ligase which joins any two molecules at their 
ends. That is, the simplified cut-and-paste rules are just 
For simplicity we do not allow circular molecules. 

The dynamics of our model will be parameterized by two rates, a and 13 . We 
interpret a as the probability that a given ordered pair A, B of molecules will 
paste to yield AB in a unit of time, and we interpret l 3 as the probability that a 
cut operation will occur at a given site on a given molecule A in a unit of time. 
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Let Sk = Sk{t) be the set of molecules of type at time t, let Nk = Nk{t) 
be the number of such molecules (or the concentration - i.e., the number per 
unit volume), and let N = N(t) = total number of molecules. 

Consider the following four contributions to the rate of change of Nk in a small 
time interval of length At: 

First, a certain number will be pasted to other molecules. A single molecule 
A of type can be pasted to any other molecule B in two ways (yielding either 
AB or BA), so the probability of pasting A and B is 2a At. Since there are N—1 
other molecules the probability that A will be pasted to some other molecule is 
2a{N — l)At, and, since this operation removes A from Sk, the expected change 
in Nk due to pastings with other molecules is —2a{N — l)NkAt. Since N is 
very large we shall approximate this as —2aNNk- Note that this may seem to 
overestimate the decrease due to pastings when both A and B are in Sk, since 
both A and B will be considered as candidates for pasting; but this is correct, 
since if two elements of Sk are pasted, then Nk decreases by 2, not by 1. 

Of course, pasting operations involving smaller molecules may produce new 
elements of Sk- For molecules in Si and Sj where i+j = k the same reasoning as 
above produces aNiNjAt new molecules in Sk- There is no factor of 2 here since 
the molecule from Si is considered to be pasted on the left of the one from Sj. 
Also, if z = j, then we actually have a{Ni — l)NiAt because a molecule cannot 
paste to itself, and we approximate this as aNfAt. This is not as easy to justify 
as above, since Ni is not necessarily large; however, this approximation seems to 
be harmless. The total corresponding change in Nk is NiNjAt. 

Third, a certain number of the molecules in Sk will be cut. Each molecule in 
Sk has k — 1 cutting sites, so there are {k — l)Nk cutting sites on the molecules 
of Sk, so we expect Nk to change by —P{k — l)NkAt. 

Finally, new molecules appear in as a result of cutting operations on 
longer molecules, and, since At is a very small time interval, we do not consider 
multiple cuts on the same molecule. If A is a molecule in Sm, then there are 
m — 1 different cutting sites on A, and the result of cutting at the site is 
two fragments, one in Sj and the other in Sm-j- Hence, ii m > k, exactly two 
molecules in Sk can be generated from A by cutting, and the total expected 
change in Nk due to cutting molecules in Sm is 2f3NmAt. Summing these gives 
a total expected change in Nk of 2/3 NmAt. 

So we have the following basic system of equations: 

N'k = -2aNNk - f}{k - l)Nk + a ^ NiNj + 2/3 ^ Nm- (1) 

i-\-j—k m>k 

Define M = kNk. If /i is the mass of a single molecule in Si, then 
represents the total mass of DNA, so M should be a constant. We verify this 
from (1) as a consistency check: 

We will ignore all convergence questions. We have M' = ^^k- Plugging 

in (1), we have four sums to consider, as follows: 
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-2a^kNNk = -2aMN, 

k 

“Y. E kNiNj = a + j)N,Nj = 2aMN, 

k i-\-j—k i,j 

^ fc(fc - l)iVfc = -/3 ^ m{m - l)Nm, 

k m 

2/3 EE kN^ = (3 E hE k \ Nm = m(m - l)iV„. 

k m>k m \ k<m / m 

Adding these gives M' = 0. 

Summing the equations (1) to get an equation for N yields four sums, which 
we treat similarly: 

-2a^NNk = -2aN‘^, 

k 

a ^ ^ NiNj = a ^ NiNj = aN^ , 
k i-\-j—k i,j 

-pY.{k-l)Nk = -p{M-N), 

k 

2/3 EE /V„ = 2/3 EE lJ/V^ = 2/3(M-/V). 

k m>k m \k<m / 



Hence 

N' = -aN^ -!3N + (3M. (2) 

Equations of this form are solved explicitly in elementary differential equation 
texts; in this case the solution is 



N = N + 



yCe-T"* 



where 7 = a//3^ + AafiM, N is the positive solution of 



- aiV^ - /3/V + /3M = 0, 



( 3 ) 

( 4 ) 



and C is determined by the initial conditions. From the solution (3) and a 
consideration of the direction field for the differential equation (2) it is clear 
that if /V(0) > 0, then N{t) is defined for all t > 0 and N ^ N as t ^ 00 . 

We want to find the limiting values of the quantities Nk- First we have a 
simple result in differential equations. 



Lemma 1. Suppose a and b are continuous functions on [tojOo) a(f) ^ 
a > 0 and b{t) ^ b as t ^ 00 . Then any solution of Y' = —oY + b satisfies 
limt^oo Y{t) = b/d. 
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We omit the proof of the lemma, which uses standard comparison techniques 
for solutions of ordinary differential equations. 

Now we replace Yl,m>k with N — ~ in (1) and rearrange to 

get 



N'k 



k-1 



-2aNNk- P{k-l)Nk + a ^ N^Nj + 2/3 iV-^W 



i-\-j=k 



2=1 

k-1 



— {2cxN + f3{k + l))A^fc + O' ^ ^ NiNj H- 2f3 I ^ ^ 

i-\-j—k \ i—l / 




— —dkNk + bk- 



The point of this rearrangement is that the coefficients ak and bk only depend on 
the functions Nj for j < k and on the known function N . Hence, if we solve the 
equations in sequence, then the coefficients ak and bk can be treated as known 
functions of t. 

It is convenient to write our results in terms of the asymptotic average molec- 
ular size W = M/N; using (4) we have f3W = aN + (3. Clearly the coefficient 
Ofc has the limit a,k = 2aN + (3{k + 1). We can find the limit Nk of Nk from 
Lemma 1 once we know the limit bk of bk- Using a routine but messy induction 
(which we omit) we show that bk = a,kNk, where 



Nk = lim Nk{t) 

t — >-oo 



N fW-l\ 

w V w ) 



4 Definitions 

We suppose we have a finite H scheme cr and an initial language I, defining 
the splicing language L = a* (I). Given two words w,z in L we write w z 
to mean that there is some word w' in L (possibly the same as w or z) so 
that either w,w' or w',w splices, using a rule of a, to produce z. Then is 
a binary relation on L. As usual we define the transitive closure and the 
reflexive transitive closure of Precisely, w z means there is some 
finite sequence wq, wi, . . . ,Wn of words of L so that n > 1, wq = w, Wn = z, and 
Wk Wk+i for 0 < fc < n, and w z means w z or w = z. 

We say a word w G L is a, first-order limit of L iff for any z in L for which 
w z we have z w. A word which is not a first-order limit is called 
transient in L; in other words, w is transient in L iff there is a word z in L so 
that w z but z w is false. This notion of transience is meant to model 
the following: The splicing operations w z contributes some of the material 
of the molecule w to the molecule z, and this material is never reassembled in a 
molecule of type w; hence the material of w will eventually be used up. 

We now define L\ to be the set of first-order limit words of L, and we continue 
recursively to define the set Lk of fc*^-order limits of L to be the set of first order 
limits of Lk-i- That is, we obtain Lk from Lk-i by deleting the words that are 
transient in Lk-i- 
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Finally we define the limit language as 

OO 

Lc>o — -Ffc. 
fc=i 

The limit languages described informally in the examples in Section 2 all 
satisfy this definition. Example 9 shows the difference between limits of order 1 
and order 2. 

We will often use the following interpretation of the relation We consider 
a directed graph in which the vertices are the words of L, so that there is 
an edge from w to z if and only if w z. We call this the splicing graph of L, 
although it is not determined just by L but by the pair (L, a). 

In this interpretation we can describe the limit language as follows: We start 
by determining the strongly connected components of Gl; these are the maximal 
subgraphs G of the graph so that, for any vertices w,z of G, we have w z. 
Such a component is called a terminal component if there is no edge w z 
with w in G and z not in G. The first order limit language Li consists of the 
vertices which lie in the terminal components, and the transient words in L are 
the vertices of the non-terminal components. 

We then define a subgraph G\ of Gl whose vertices are the vertices of the 
terminal components, so that there is an edge from w to z if and only if there is 
a splicing operation (w, w') z or (w', w) z using a rule of cr in which w' 
is in Li. Then, of course, Lao is the set of vertices of the intersection G|° of the 
chain of subgraphs constructed recursively by this process. 



5 Regularity 

We continue with the terminology of Section 4. It is well-known that the splicing 
language L is regular ([1], [9]), and it is natural to ask whether the limit language 
Lao, is regular. We do not know the answer in general. However, we can give a 
satisfactory answer in the important special case of a reflexive splicing system. 
In Head’s original formulation of splicing [5] both symmetry and refiexivity were 
understood, and there are good biochemical reasons to require that any H system 
that purports to model actual chemical reactions must be both symmetric and 
reflexive. 

Theorem 1. Suppose a is a finite reflexive H scheme and I is a finite initial 
language. Then the limit language Lao is regular. 

Proof. First, let S be the collection of all sites of rules in a. That is, the string 
s is in S' if and only if there is a rule {ui,vi; U2,V2) in cr with either s = uiVi 
or s = U2V2. For each s G S we let = L n A*sA*] that is, Lg consists of the 
words of L which contain s as a factor. We shall refer to a set of the form Lg as 
a site class. We also define Li = L \ A*SA* . This is the set of inert words of L; 
that is, the set of words which do not contain any sites. 
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Suppose x,y G Lg- By reflexivity we may find a rule r = (u,v; u,v) of cr so 
that s = uv. Write x = x\uvx 2 and y = yiuvy 2 - Then splicing x and y using r 
produces z = x\uvy 2 , and splicing y and z using r produces y\uvy 2 , or y. Hence 
we have x z y, and z is in Lg. We conclude that the subgraph of Gl 
with the elements of Lg as vertices forms a strongly connected subgraph of the 
vertices of Gl, and so it lies in one strongly connected component of Gl- 

We conclude that each strongly connected component of Gl is either a sin- 
gleton { w } with w G Lj or a union of a collection of site classes. The first type 
of component is clearly terminal. Thus the set Li of first-order limit words is 
the union of Lj and some collection of site classes. Moreover, any site class Lg 
which appears in Li is still a strongly connected subset of the vertex set of G\. 
Hence we can use induction to extend this decomposition to see that each Lk is 
the union of L/ and a collection of site classes. 

Since the sets Lk+i C Lk and there are only finitely many site classes we 
see that the sequence Lk is eventually constant, so Loo = Lk for all sufficiently 
large k. Hence Loo is the union of L/ and a collection of site classes. This is a 
finite union of sets, each of which is regular (using the regularity of L), so Loo 
is regular. 

Corollary 1. If a is a finite reflexive H scheme and I is not empty, then Loo 
is not empty. 

Proof. This is clear from the proof, since there are a finite number of strongly 
connected components (aside from the inert words) at each stage, and so there 
are terminal components at each stage; and the recursive construction stops after 
finitely many steps. 

It has been a difficult problem to determine the class of splicing languages as 
a subclass of the regular class, even if we restrict attention to reflexive splicing 
schemes; see [3]. Hence the following is rather unexpected. 

Theorem 2. Given any regular language K there is a finite reflexive and sym- 
metric splicing scheme a and a finite initial language I so that the limit language 
Leo LI . 

Proof. This is an easy consequence of a theorem of Head [6], which provides a 
finite reflexive and symmetric splicing scheme (Tq and a finite initial language I 
so that ag{I) = cK, where c is some symbol not in the alphabet of K. In fact, 
every rule of (Tq has the form (cu, 1; cv,l) for strings u and v. 

Our only modification involves adding the rules rg = (c, 1; 1, c), fg = 
(1, c; c, 1), fg = (c, 1; c, 1) and rg = (1, c; 1, c) to erg to define a. This changes the 
splicing language to L = c*K. To see this, note that if j and k are positive, then 
c^w and c^z splice, using any of the new rules to produce c'^z, where, depending 
on the rule used, m ranges over the integers from 0 to j -I- k. 

Now all words oi L\K have the form c^w with fc > 0, and these are transient 
since any such word splices with itself using fg to produce the inert word w G K. 
Hence the limit language is just K, the set of inert words. 
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6 Conclusion 

We wrote this paper to introduce the notion of the limit language. This notion 
can be supported by qualitative reasoning, as in the examples in Section 2. 
However, the splicing operation is not a realistic model of the actual chemical 
reactions; for this we need something like cut and paste operations to model 
the separate chemical actions of restriction enzymes and ligases. Thus Examples 
6 and 7 give a better account of the molecules involved during the chemical 
reactions than Example 5. 

We believe that a complete understanding of the limit language must wait 
until a suitable dynamical model has been developed and supported by exper- 
iment. It is not hard to formulate a generalization of the example in Section 3 
to apply to arbitrary cut-and-paste systems, but it is much more difficult to 
solve such a generalization. We would hope eventually to derive regularity of the 
splicing language from the dynamical system, and then to define and analyze 
the limit language directly from the dynamical system. 

Our preliminary definition of the limit language in Section 4 is simply an 
attempt to circumvent the considerable difficulties of the dynamical systems 
approach and see what we might expect for the structure of the limit language. 
We cannot yet prove that this definition coincides with the natural definition 
based on dynamical systems, but we believe that it will, probably with minor 
restrictions on the splicing scheme. 
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Abstract. Gene assembly in ciliates is a life process fascinating from 
both the biological and the computational points of view. Several formal 
models of this process have been formulated and investigated, among 
them a model based on (legal) strings and a model based on (overlap) 
graphs. The latter is more abstract because the translation of legal strings 
into overlap graphs is not injective. In this paper we consider and solve 
the overlap equivalence problem for realistic strings: when do two differ- 
ent realistic legal strings translate into the same overlap graph? Realistic 
legal strings are legal strings that “really” correspond to genes generated 
during the gene assembly process. 



1 Introduction 

Ciliates (ciliated protozoa) is a group of single-celled eukaryotic organisms. It 
is an ancient group - about two billion years old, which is very diverse - some 
8000 different species are currently known. A unique feature of ciliates is nu- 
clear dualism - they have two kinds of functionally different nuclei in the same 
cell, a micronucleus and a macronucleus (and each of them is present in multi- 
ple copies), see, e.g., [15] for basic information on ciliates. The macronucleus is 
the “standard household nucleus” that provides RNA transcripts for producing 
proteins, while the micronucleus is a dormant nucleus that gets activated only 
during sexual reproduction. At some stage of sexual reproduction the genome 
of a micronucleus gets transformed into the genome of a macronucleus in the 
process called gene assembly. The process of gene assembly is fascinating from 
both the biological (see, e.g., [14], [15], [16], and [18]) and from the computa- 
tional (see, e.g., [4], [8], [10], [12], [13], [17]) points of view. A number of formal 
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models of gene assembly were studied, including both the intermolecular models 
(see, e.g., [12], [13]) and the intramolecular models (see, e.g., [4] - [11], [17]). 
This paper continues research on intramolecular models. We consider here for- 
mal representations of genes generated during the process of gene assembly in 
two such models: one based on (legal) strings (see, e.g., [8]) and one based on 
(overlap) graphs (see, e.g., [4]). The latter model is more abstract than the for- 
mer because the translation of legal strings into overlap graphs is not injective. 
This fact leads naturally to the following overlap equivalence problem for legal 
strings: when do two different legal strings yield the same overlap graph? 

In this paper we consider the overlap equivalence problem and solve it for 
realistic legal strings (i.e., those legal strings that “really” correspond to genes 
generated during the process of gene assembly, see, e.g., [4]). 

2 Preliminaries 

For positive integers k and n, we denote by [k, n] = {k, fc-l- 1, . . . , n} the interval 
of integers between k and n. 

Let A" = {tti, 02 , . . . } be an alphabet. The set of all (finite) strings over S is 
denoted by S* - it includes the empty string, denoted by A. For strings u,v G S*, 
we say that is a substring of u, if m = W\VW 2 for some wi,W 2 G S* . Also, v is 
a conjugate of m, if u = W\W 2 and v = W 2 W\ for some strings w\ and W 2 - 

For an alphabet A, let A = {a | o G S} be a signed copy of S (disjoint from 
S). Then the total alphabet A U A is called a signed alphabet. The set of all 
strings over A U A is denoted by , that is, 

= (auT)* . 

A string v G is called a signed string over S. We adopt the convention that 
a = a for each a G A. 

Let V G be a signed string over A. We say that a letter a G AU A occurs in 
v, if either a or a is a substring of v. Let dom(z;) = {a j a G A and a occurs in u} 
be the domain of v. Thus the domain of a signed string consists of the unsigned 
letters that occur in it. 

Example 1. Let A = {a, 6}, and thus A = {a, 6}. The letters a and b occur in 
the signed string v = abaa, although b is not a substring of v (its signed copy b 
of 6 is a substring of v). Also, dom(r;) = {a, b}. □ 

The signing a a can be extended to longer strings as follows. For a 
nonempty signed string u = ai 02 . . . a„ G A^, where G A U A for each i, 
the inversion m of m is defined by 

u = a„a„_i . . . oi G A^ . 

Also, let = a„a„_i...ai and = aia 2 ■ ■ -an be the reversal and the 
complement of u, respectively. It is then clear that u = (u^)^ = 
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Let S and F be two alphabets. A function S* F* \s & morphism, if 
(fi{uv) = (f{u)(f{v) for all strings u,v G E* . A morphism ip: ^ F^ between 

signed strings is required to satisfy ip{u) = (p{u) for all u,v G Note that, 
for a morphism ip, the images ip{a) of the letters a G S determine ip, i.e., if 
the images of the letters a G E are given, then the image of a string over A is 
determined. 

Example 2. Let A = {a, 6} and F = {0,1,2}, and let ip: A^ ^ be the 
morphism defined by ip{a) = 0 and ip(b) = 2. Then for u = abb, we have 
ip{u) = 022 and ip(u) = ip(bba) = 220 = ip{u). □ 

Signed strings u G A^ and v G F^ are isomorphic, if there exists an injective 
morphism ip: A^ ^ F^ such that p{E) C F and if{u) = v. Hence two strings 
are isomorphic if each of the strings can be obtained from the other by renaming 
letters. 

A signed string v = aia2...a„ is a signed permutation of another signed 
string u = 6162 . . . 6n, if there exists a permutation zi,i2,...,i„ of [l,n] such 
that Qj G {bi^,bi^} for each j G [l,n]. 

Example 3. Let u = aabcb G {0,6, c}^. Then m is a signed permutation of 
acabb, as well as oiababc (and many other signed strings). Also, u is isomor- 
phic to the signed string v = aacbc. The isomorphism ip in question is defined 
by: ip{a) = a, ip(c) = b, and ip{b) = c. Hence also if{a) = a, ip{c) = b, and 
ip(b) = c. □ 

3 MDS Arrangements and Legal Strings 

The structural information about the micronuclear or an intermediate precur- 
sor of a macronuclear gene can be given by the sequence of MDSs forming the 
gene. The whole assembly process can be then thought of as a process of as- 
sembling MDSs through splicing, to obtain the MDSs in the orthodox order 
Ml, M2, . ■ . , M^. Thus one can represent a macronuclear gene, as well as its 
micronuclear or an intermediate precursor, by its sequence of MDSs only. 

Example 4- The actin I gene of Sterkiella nova has the following MDS/IES mi- 
cronuclear structure: 



M^I\ M4. 1 2 Mq a Mi^ I 4. M^I^ Mg Iq M 2 L7 M 1 7g M^ ( 1 ) 

(see Fig. 1(a)). The ‘micronuclear pattern’ is obtained by removing the lESs A. 
For our purposes, the so obtained string 

oi = M^MiMfiMgMr MgM gMiM^ 

has the same information as the structure given in the representation (1). 

The macronuclear version of this gene is given in Fig. 1(b). There the lESs 
have been excised and the MDSs have been spliced (by overlapping of their ends) 
in the orthodox order Mi, M2, ■ ■ ■ , Mg. □ 
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(a) 



Mz A^4 Mq M.Z Ml Mg M 2 Ml Ms 



Ii I2 I3 I4 h H 




h 



Ml M2 Mz M4 A/5 Mq Ml Mg 

(b) I I I I I I I ll~ 

Ms 

Fig. 1. (a) The micronuclear version of the actin I gene in Sterkiella nova, (b) 
The macronuclear version of the actin I gene of Sterkiella nova. The vertical 
lines describe the positions where the MDSs have been spliced together (by 
overlapping of their ends) 



For each k> 1, let 

6>« = { Mi I 1 < i < k} 

be an alphabet representing elementary MDSs, i.e., each Mi denotes an MDSs 
that is present in the micronucleus. The signed strings in 0^ are MDS arrange- 
ments {of size k). An MDS arrangement a € 6>* is orthodox, if it is of the form 
M 1 M 2 . . . M„. Note that an orthodox MDS arrangement does not contain any 
inversions of MDSs, and the MDSs are in their orthodox order. A signed permu- 
tation of an orthodox MDS arrangement a is a realistic MDS arrangement. 

A nonempty signed string v G is a legal string over A if every letter 
a G dom(u) occurs exactly twice in v. (Recall that an occurrence of a letter can 
be signed.) 

A letter a is positive in a legal string v, if both a and a are substrings of v, 
otherwise, a is negative in v. 

Let Z\ = {2,3,...}bea designated alphabet of pointers. 

Example 5. Let v = 24325345 be a legal string over A. Pointers 2 and 5 are 
positive in u, while 3 and 4 are negative in v. On the other hand, the string 
■u; = 2432535 is not legal, since 4 has only one occurrence in w. □ 

We shall now represent the MDS arrangements by legal strings over the 
pointer alphabet A. In this way legal strings become a formalism for describ- 
ing the sequences of pointers present in the micronuclear and the intermediate 
molecules. 

Let Qk ■ of ^ be the morphism defined by: 

£»k(Mi)=2, p„(M„) = k, QK,{Mi) = ii 1 ior 2 < i < k , 
and QK,{Mi) = gi^{Mi) for 1 < z < k. 

We say then that a legal string u is realistic if there exists a realistic MDS 
arrangement a such that u= Qk (a) . 
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Example 6. Consider the micronuclear MDS arrangement of the actin I gene of 
Sterkiella nova: a = M3M4M6M5M7M9M2M1M8. We have 

09 (a) = 34 45 67 56 78 9 3 2 2 89 . 

Since a is a realistic MDS arrangement, pg(a) is a realistic legal string. □ 

The following example shows that there exist legal strings that are not real- 
istic. 

Example 1. The string u = 234324 is legal, but it is not realistic, since it has no 
‘realistic parsing’. Indeed, suppose that there exists a realistic MDS arrangement 
a such that u = 04(a). Clearly, a must end with the MDS M4, since Qi(M) ^ 24 
for all MDSs M. Similarly, since qa(M) ^ 32 for all MDSs M, a must end with 
M2M4. Now, however, u = 2 343 04(M2M4) gives a contradiction, since 3 and 
43 are not images of any MDSs M. □ 

4 Overlap Graphs of Legal Strings 

Let u = a\a 2 ■ ■ ■ ttn G 27^ be a legal string over S, where G 27 U 27 for each i. 
Let for each letter a G dom(rt), i and j with 1 < z < j < n be indices such that 
Qi,aj G {a, a}. Then the substring 

W(a) — ■ ■ ■ O-j 

is the a-interval of u. Two different letters a,b € E are said to overlap in u, 
if the a-interval and the 6-interval of u overlap: if = a^j . . . and «({,) = 
Oij . . . Ojj , then either zi < Z2 < ji < j2 or Z2 < zi < j2 < ji- 

Example 8. Let u = 243532657467 be a string of pointers. The 2-interval of 
u is the substring zz(2) = 2 4 3 5 3 2, which contains only one occurrence of pointer 
4 and pointer 5, but either two or no occurrences of 3, 6 and 7. Hence pointer 2 
overlaps with 4 and 5, but not with 3, 6 and 7. □ 

For a finite set V, let 

E(V) = {{x,y} \x,y&V, x ^ y} 

be the set of all unordered pairs of different elements of M. A graph is a pair 
(V,E), where C is a finite set of vertices and E C E(V) is a set of edges. A 
signed graph 7 = (V,E,a) consists of a graph (V,E) together with a labelling 
a: V { — ,+} of the vertices. Vertices labelled by -I- are positive, and those 
labelled by — are negative. We use and x~ to indicate that (j(x) = + and 
(j{x) = — , respectively. 

We shall use signed graphs to represent the structure of overlaps of pointers 
occurring in a legal string as follows. 

For a pointer p G A, let p = {p,p}. Let v G A^ be a legal string of pointers, 
and let 



P. = {P I p occurs in u} . 
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Then the overlap graph of v is the signed graph 7 „ = (?„, E, a) such that 



and 




if p G Z\ is positive in v , 
if p G Z\ is negative in v , 



{p,q} G E 



p and q overlap in v . 



Example 9. Consider the legal string v = 34523245 G . Then its overlap 
graph is given in Fig. 2. Indeed, pointers 2, 3 and 5 are positive in v (and hence 
the vertices 2, 3 and 5 have sign + in the overlap graph 7 „), while pointer 4 is 
negative in v (and the vertex 4 has sign — in 7 „). The intervals of the pointers 
in V are 

f( 2 )= 2 32, n( 3 )= 3452 3, 

W( 4 ) = 452 3 24, U( 5 ) = 52 3245, 

The overlappings of these intervals yield the edges of the overlap graph. □ 




Fig. 2. The overlap graph of the signed string v = 34523245. An edge is 
present between the signed vertices p and q if and only if p and q overlap in v 



Remark 1. Overlap graphs of unsigned legal strings are also known as circle 
graphs (see [1,2,3]) . These graphs have a geometric interpretation as chord dia- 
grams, where a chord diagram ( 6 , V) consists of a circle 6 in the plane together 
with a set V of chords. A graph 7 = {V, E) is a circle graph, if there exists a 
chord diagram (C, F) such that {x,y} G E for different x and y if and only if 
the chords x and y intersect each other. □ 
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5 The Overlap Equivalence Problem 

The mapping w ^ oi legal strings to overlap graphs is not injective. Indeed, 
for each legal string w = W 1 W 2 , v/e have 

'JwiW2 ~ ^W2Wi and 7 ™ = 7 c(„) , 

where c is any morphism that selects one element c{p) from p = {p, p} for each 
p (and then obviously c{p) = c(p)). In particular, all conjugates of a legal string 
w have the same overlap graph. Also, the reversal and the complementation 
w'^ of a legal string w define the same overlap graph as w does. 

Example 10. The following eight legal strings of pointers (in Z\^) have the same 
overlap graph: 2323, 3232, 2323, 3232, 2323, 3232, 23^, 3^. □ 

Example 11. All string representations of an overlap graph 7 need not be ob- 
tained from a fixed representation string w by using the operations of conjuga- 
tion, inversion and reversing (and by considering isomorphic strings equal). For 
instance, the strings v\ = 23342554 and V 2 = 35242453 define the same overlap 
graph (see Fig. 3), while vi is not isomorphic to any signed string obtained from 
V2 by the above operations. However, we will demonstrate that if strings u and 
V are realistic, then 7 „ = 7 „ implies that u and v can be obtained from each 
other by using the operations of conjugation, inversion and reversing. □ 




Fig. 3. The common overlap graph of legal strings v\ = 23342554 and V 2 = 
35242453. The signed string v\ and V 2 are not obtainable from each other by the 
operations of conjugation, reversal and complementation 



Examples 10 and 11 lead naturally to the overlap equivalence problem for 
realistic legal strings: when do two realistic legal strings u and v have the same 
overlap graph, i.e., when 7 „ = 7 „? To solve this problem, we begin by charac- 
terizing legal substrings of realistic legal strings. 

For a signed string v G , let 

■Cmin = min(dom(u)) and Umax = max(dom(u)) . 

Thus, dom(u) C [Vmin , Umax] • 
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Example 12. The string u = 483483 is legal. We have dom(i;) = {3,4,8}, and 
hence t^min ^ 3 and t^niax ^ 8 . ^ 



Lemma 1. Let u he a realistic legal string with dom(M) = [2,k], and let v be a 
legal substring of u. Then either dom(r;) = [p,q] or dom(w) = [2,k] \ [p, (?], for 
some p and q with p < q. 

Proof. Let u = U\VU 2 - without loss of generality we can assume that u\ and U 2 
are not both empty strings. 

(a) Assume that there exists a pointer p with Vmin < p < Umax such that 
p ^ dom(r;) and p — 1 G dom(r;). Then p > 2. Let w = {p — l)p (= g„(Mp_i)). 
Since u contains w or w, it follows that either 

u = uiip {p — l)viU 2 , where n = (p — l)wi, or 
t 6 = uini(p — l)pt 62 i, where V = wi(p — 1). 

(b) Similarly, if there exists q with ?;min < q < Vmax such that q ^ dom(t>) 
and g + 1 G dom(u), then either 

u = u\ 2 q{q + l)v 2 U 2 , where v = (q + l)v 2 , or 
u = u\V 2 {q + l)qu 22 , where v = V 2 {q + 1 ). 

The following three cases can hold now and each of them yields the conclusion 
from the statement of the lemma. 

1. There are no pointers p as in (a). In this case, dom(u) = [q + 1,k], where 
g > 2 is a unique pointer for which (b) holds. 

2. There are no pointers g as in (b). In this case, dom(r;) = [2,p — 1], where 
p < K is a unique pointer for which (a) holds. 

3. There exists a pointer p as in (a) and a pointer g as in (b). By the above, p 
is unique for (a) and g is unique for (b). Hence either dom(t!) = [g+ l,p — 1], 
ifp— l>g+lor dom(t>) = [2,p — 1] U [g + 1, k], if p — 1 < g + 1. 



□ 



Example 13. Let 

u = 7678298329345456. 

Then u is realistic, since u = qq{MqMtMiMsM 2 MqM 3 M 4 M^). The string u 
has the following legal substrings (in addition to u and A)-. v\ = 4 545 with 
dom(ui) = [4,5], V 2 = 8 2 983 2 93 with dom(u) = [2,3] U [8,9], and V 3 = 
8 2 98 3 29 34545 with dom(t> 3 ) = [2,5] U [8,9]. □ 

The following result provides more details of the structure of legal substrings 
of realistic legal strings. 
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Lemma 2. Let u he a realistic string with dom(M) = [2,k], and let v be a legal 
substring ofu. 

(i) //2 ^ dom(ii), then either v = Vminv' or v = v'vmin- 
(ii) If K ^ dom(w), then either v = Vm&xv' or v = 

(Hi) //2,k e dom(ti), then dom(u) = [2,p] U [g, k] and either v = qv'p or 
V = pv'q. 

Proof. Assume that v ^ u, and suppose first that 2 ^ dom(u), and let p = Umin- 
By Lemma 1, dom(u) = [p, q] for some pointer q with p < q < k. Now since 
p — 1 ^ dom(u) and either (p — l)p or p{p — 1) is a substring of u, it must be 
that either v begins or v ends with an occurrence of p. But this is possible only 
when V = pv' or v = v'p, as required. 

The claim for the case k ^ dom(u) is similar to the above case. 

If 2, K G dom(t!), then dom(ri) = [2,p] U [q, k] for some p and q with p < q 
by Lemma 1. Since either p{p+ 1) or p(p — 1) is a substring of u, either v = vip 
or V = pvi . Similarly, either v = qv 2 or v = V 2 q, and so the last claim from the 
statement of the lemma holds. □ 

The following theorem gives a characterization for those pairs of realistic 
legal strings that yield the same overlap graph. 

Recall that for a string v = pip 2 ■ ■ - Pn G , the reversal of v is defined 
by = PnPn-i ■ ■ - Pi, and the complementation by = Pip2 ■ ■ - Pn- For two 
signed strings u,v G A^ , let u « v if m is obtained from a conjugate of by a 
composition of the operations of reversal and complementation. Thus, u « w if 
and only if there exist strings v\ and V 2 such that v = V\V 2 and u is one of the 
strings V 2 V 1 , (v 2 Vi)^, (v 2 Vi)^ , or {{v 2 Vi)^)^ . 

Theorem 1. Let u and v he two realistic legal strings such that dom(u) = 
[2, k] = dom(u). Then = 7 „ if and only if u~ v. 

Proof. Assume that 7 „ = 7 «. We prove the theorem by induction on k. If k = 2, 
then clearly u~v. Suppose then that the claim holds for strings over [2, k — 1]. 

Since we now consider strings up to equivalence with respect to conjugation, 
reversal and complementation, we can assume without loss of generality that 
both u and v end with = (k — 1)k. Now 

u = ui k' U 2 {k — 1)k and v = vi k' V 2 {k — 1)k , (2) 

where k' = k or k' = k. The strings u' = u\U 2 {n — 1) and v' = v\V 2 {n — 1), that 
are obtained from u and v by erasing the occurrences of k, are realistic, and since 
7ii' = 7 j)', we have u' « v' by the induction hypothesis. Since the sign of the 
last occurrence of (k — 1) is the same in u' and v' , complementation is not used 
in the above equivalence u' « v'. Therefore either u' is a conjugate of v' or of 
{v')^. Moreover, the last occurrences (k — 1) in u' and in v' both correspond to 
the end marker, which implies that u' = v' . If U 2 = V 2 then also u = v, and the 
claim holds. Suppose then that u ^ v. Now, U 2 ^ V 2 , and either U 2 is a suffix of 
V 2 or V 2 is a suffix of U 2 . By symmetry, we can assume that the first alternative 
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holds: V 2 = WU 2 , and then also ui = v\w. Since 7 „ = 7 «, the substring w must 
be a legal string (if only one occurrence of a pointer p is in w, then n would 
overlap with p either in u or v, but not in both). Now 

u = v\w k! U 2 {n — and v = v\ k' wu 2 {n — . 

By the form of the substrings wk' in u and k'w in v in the above, w must be 
an image QK.{a) = w for some a € Of_ 2 - By Lemma 1, dom(w) = [p, q] for some 
pointers p < q, and by Lemma 2(ii), either w = w'q or w = qw' for a substring 
w' . Moreover, as shown in the proof of Lemma 2, in the former case w{q + 1) 
is a substring of both u and v and in the latter case, {q + l)w is a substring of 
both u and v. It then follows that q = k — 1, and so occurs twice in 

u and V - this is impossible since u and v are realistic. □ 
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Abstract. In this paper, we define the n-insertion B of a language 

A into a language B and provide some properties of n-insertions. For 
instance, the n-insertion of a regular language into a regular language 
is regular but the n-insertion of a context-free language into a context- 
free language is not always context-free. However, it can be shown that 
the n-insertion of a regular (context-free) language into a context-free 
(regular) language is context-free. We also consider the decomposition of 
regular languages under n-insertion. 



1 Introduction 

Insertion and deletion operations are well-known in DNA computing - see, e.g., 
[5] and [6]. In this paper we investigate a generalized insertion operation, namely 
the shuffled insertion of (substrings of) a string in another string. 

Formally, we deal with the following operation. 

Let u,v G X* and let n be a positive integer. Then the n-insertion of u into v, 
i.e., is defined as {viU\V 2 U 2 • • • | u = U\U 2 • • • Un, Ui,U 2 , ■ ■ ■ ,Un 

G X*,V = VIV 2 ■ ■ ■VnVn+i,vi,V 2 , ■ ■ . G X*}. For languages A, B C X*, 

the n-insertion A |>[”1 B of A into B is defined as v^b “ The shuffle 

product Ao B of A and B is defined as Un>i A’o'^A b. 

In Section 2, we provide some properties of n-insertions. For instance, the 
n-insertion of a regular language into a regular language is regular but the n- 
insertion of a context-free language into a context-free language is not always 
context-free. However, it can be shown that the n-insertion of a regular (context- 
free) language into a context-free (regular) language is context-free. In Section 
3, we prove that, for a given regular language L C X* and a positive integr 
n, it is decidable whether L = A R for some nontrivial regular languages 
A,B C X*. Here a language C C X* is said to be nontrivial if C yf {e}, where 
e is the empty word. 

Regarding definitions and notations concerning formal languages and au- 
tomata, not defined in this paper, refer, for instance, to [2]. 

2 Shuffle Product aud n-Iusertiou 

First, we consider the shuffle product of languages. 



N. Jonoska et al. (Eds.): Molecular Computing (Head Festschrift), LNCS 2950, pp. 213—218, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Lemma 21 Let A, B C X* be regular languages. Then Ao B is a regular lan- 
guage. 

Proof. By X we denote the new alphabet {a | o € X}. Let A = {S,X,S,so,F) 
be a finite deterministic automaton with C{A) = A and let B = (T,X,0,to}G) 
be a finite deterministic automaton with £(B) = B. Define the automaton B — 
{T,X,0Ao,G) as 9{t,a) = Oft, a) for any t G T and a G X. Let p be the 
homomorphism of {X U X)* onto X* defined as p{a) = pfa) = a for any a G X. 
Moreover, let C{B) = B. Then p{B) = {p{u) \uGB} = B and p{AoB) = AoB. 
Hence, to prove the lemma, it is enough to show that AoB is a regular language 
over XUX. Consider the automaton AoB = {S x T,XUX,So9, (so,to), F 'x G) 
where S o 9{{s,t),a) = (6(s, a),t) and 6 o 9{{s,t),a) = (s, 9{t, o)) for any (s,t) G 
SxT and a G X. Then it is easy to see that w G £{AoB) if and only if w G AoB, 
i.e., Ho i? is regular. This completes the proof of the lemma. 

Proposition 21 Let A,BC X* be regular languages and let n be a positive 
integer. Then H H zs a regular language. 

Proof. Let the notations of X, B and p be the same as above. Notice that 
A B = (AoB) n (X*X*)"X*. Since (X*X*)’^X* is regular, A B is 

regular. Consequently, A B = p(A |>["1 B) is regular. 

Remark 21 The n-insertion of a context-free language into a context-free lan- 
guage is not always context-free. For instance, it is well known that A = {a"6" | 
n > 1} and B = | n > 1} are context-free languages over {a, b} and {c, d}, 

respectively. Since (A B) n a+c+6+d+ = {a^c™b'^d™ \ n,m > 1} is not 
context-free, H is not context-free. Therefore, for any n > 2, n-insertion of 

a context-free language into a context-free language is not always context-free. 
However, H H is a context-free language for any context-free languages A 
and B (see [4]). Usually, a 1-insertion is called an insertion. 

Now consider the n-insertion of a regular (context-free) language into a 
context-free (regular) language. 

Lemma 22 Let A C X* be a regular language and let B C X* be a context-free 
language. Then AoB is a context-free language. 

Proof. The notations which we will use for the proof are assumed to be the 
same as above. Let A = (S, X, S, sq, F) be a finite deterministic automaton with 
C(A) = A and let B = (T, X, F, 9, to, e) be a pushdown automaton with Af(B) = 
B. Let B = (T, X, F, 9, to, 70, e) be a pushdown automaton such that 9{t,a, 7) = 
9{t, a, 7) for any t G T,a G X U {e} and j G F. Then p(N(B)) = B. Now define 
the pushdown automaton AoB = (S x T, X U X, F U {ff},6 o9, (sq, to)> 7o> e) 
follows: 

1. Va G X,6o9((so,to),a,'yo) = {((<5(so, a), to), #7o)}, 

6o9{(so,to),a,jo) = {{(so,t'),#Y) \ (f ,i) G 6»(to,d,7o)}- 

2. Va G x,\/(s,t) gSx T,y-f G T U {#}, (5 o 6»((s, t), a, 7) = {((d(s, a), t), 7)}. 
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3. Vo G X,y{s,t) e S X T,y-f G r,S o 6 {{s,t),a,-f) = {((s,V),Y) | (V,Y) G 

4. V(s,t) G F X T,Soe{{s,t),e,#) = {((s,i),e)}. 

Let w = V1U1V2U2 ■ ■ .VnUnVn+i where ui,U2, ■ . . ,Un G X* and V\,V2, • • ■ , 
Vn+i G X . Assume (5 o 6<((so, to), 7o) 7^ 0 - Then we have the following 
configuration: ((so, to), w, 70) ((< 5 (so, U1U2 • • • u„), t'), e, # • • • #7') where 

(t',7') G 0(to,TiU2 • • -T„+i,7o). If w G N{AoB), then (< 5 (so, M1U2 • • • «n), t'), 
e, #---# 7 ') (d(so,MiM2---Mn),t'),e,e)- Hence ( 5 (so, U1M2 • • • Un), t') G 

F X T and 7' = e. This means that U\U2 • • - Un G A and Ui, U2, • ■ • ,T„+i G B. 
Hence w G A x B. Now let w G A x B. Then, by the above configu- 
ration, we have ((sq, to), w, 70) ((i 5 (so, M1M2 • • • «n), t'), e, # • • • #) 

((<5(s0) u\U 2 ■ ■ ■ Un), t'), e, e) and w G N{A o B). Thus Ao B = Af{A o B) and 
A o H is context-free. Since p{A oB)=AoB,AoBis context-free. 

Proposition 22 Let A C X* be a regular (context-free) language and let B C 
X* he a context-free (regular) language. Then A\>^Ajg is a context-free language. 

Proof. We consider the case that A C X* is regular and B C X* is context-free. 
Since A |>1"1 B = {Ao B) D {X* X*)'^X* and (A*A*)"A* is regular, A >["1 B is 
context-free. Consequently, A B = p{A B) is context-free. 



3 Decomposition 

Let L C X* be a regular language and let A = (S,X,6,so,F) be a finite de- 
terministic automaton accepting the language L, i.e., L{A) = L. For u,v G X*, 
by u ~ w we denote the equivalence relation of finite index on X* such that 
6{s,u) = S{s,v) for any s G S. Then it is well known that for any x,y G X* , 
xuy G L GO xvy G L \i u ^ v. Let [u] = {u G X* | m ~ for m G X* . It is 
easy to see that [m] can be effectively constructed using A for any u G X*. Now 
let n be a positive integer. We consider the decomposition L = A |>["1 H. Let 
Kn = {([mi], [^2], ■ • ■ , [m«]) I ui,U 2 , . . . ,Un G X*}. Notice that AT„ is a finite set. 

Lemma 31 There is an algorithm to construct Kn. 

Proof. Obvious from the fact that [u] can be effectively constructed for any 
u G X* and {[u] | m G A*} = {[u] | u G X* , |m| < Here |m| and \S\ denote 

the length of u and the cardinality of S, respectively. 



Let u G X*. By Pn{u), we denote {([ui], [^2], • ■ • , [wn]) | u = U1U2 • • • Un, ui, 
U2,...,Un G X*}. Let p = ([ui], [^2], . . . , [un]) G AT„ and let B^ = {v G X* \ 
{wi}['Ui]{w 2 }[m 2] • • • {Un}[Mn]{?^n-|-l} C L for any V = ViV2---VnV„+i,Vi,V2,..., 
Un, Un-^-1 G X }. 

Lemma 32 B^ C X* is a regular language and it can he effectively constructed. 
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Proof. Let 5'^*^ = | s G S'},0 < i < n, and let S = Uo<i<n'^^*^- define 

the following nondeterministic automaton A' = {S, X, S, {s® }, with 

e-moves, where | s € F}. The state transition relation S is defined 

as follows: 

S(s^^\a) = l(5(s, i5(s, for any a € XU {e} and i = 0, 1, . . . , 

n - 1, and a) = {<5(s, a)(")}, for any a e X. 

Let V € £(A'). Then S(sq, V1U1V2U2 ■ ■ ■ VnUnV„+i)(^^ € S(sg°\ V1V2 ■ ■ ■ v„Vn+i) 
n(5'(«) \ for some v = V1V2 ■ ■ ■ VnV„+i,vi,V2, ■ ■ ■ ,Vn,v„+i € X*. Hence 

V1U1V2U2 ■ ■ ■ VnUnVn+i ^ L, i.e., V G X* \ B^. Now let v G X* \ B^. 
Then there exists v = V1V2 • • ■ VnVn+i,Vi,V2, ■ ■ ■ ,Vn.Vn+i G X* such that 
V1U1V2U1 ■ ■ ■ VnUnVn+i ^ L- Therefore, ^(so°\uiU2 • • • VnVn+i) G \ F("), i.e., 
V = v\V2 • ■ ■ VnVn+1 G C{A'). Consequently, B^ = X* \ C{A') and B^ is regular. 
Notice that X* \ C{A') can be effectively constructed. 

Symmetrically, consider v = ([ui], [r)2], . . . , [t’n], [wn-i-i]) G F„+i and Ai, = 
{u € X* \ \/u = UiU2---Un,Ui,U2,...,Un G X* , [vi] {«! } [^2] {^2 } ' ' ' [w„] {Un} 
[Vn+l] C L}. 

Lemma 33 A^, C X* is a regular language and it can he effectively constructed. 

Proof. Let | s G F},! < i < n + 1 , and let S = Ui<i<n-i-i 

We define the nondeterministic automaton b' = {S, X, S, {i5(so) \ 

F(”+^)) with e-move where F^”+^) = | s G F}. The state transition 

relation 6 is defined as follows: 

a) = {i5(s, i5(s, for any a £ XU {e} 

and i = 1, 2, . . . , n. 

By the same way as in the proof of Lemma 6, we can prove that A^, = X*\C{B'). 
Therefore, Ai, is regular. Notice that X* \ C(B') can be effectively constructed. 

Proposition 31 Let A,BC X* and let L C X* be a regular language. If L = 
A\>^^Ab then there exist regular languages A' B' C X* such that A C A', BUB' 
and L = A' B' . 



Proof. Put B' = n^6p„(A) V £ B and let ^ G Pn{A). Since p. G 
Pn{A), there exists u £ A such that p = ([mi], [1x2], . . . , [rtn]) and u = 
U\U2 • • • Un, Ui, U2, . . . , Un G X* . By U V C L, WB have {ui}[mi]{w2}[w2] • • • 
{vn}[un]{vn+i} C L for any v = W1U2 . ■ • t>i, t>2, ■ • • , w„, f„+i G X*. 

This means that v £ B^. Thus B C ripGp„(A) ~ ^ - Now assume that 

u £ A and v £ B' . Let u = U1U2 • • ■ Un,ui,U2, ■ . ■ ,Un £ X* and let p = 
([mi], [M2], . . . , [m„]) G P„{u) C Pn{A). By V £ B' C B^, V1U1V2U2 ■ ■ ■ VnUnVn+l 
£ {vi}[ui]{v2}[u2] ■ ■ ■ {v„}[u„]{vn+i} Q L for any v = W1M2 • • • ui, M2, 

. . . , Vn, Vn+l £ X* . Hence m |>["1 m C F and A |>["1 B' C L. On the other hand, 
since BUB' and H |>["1 F = F, we have H |>["1 F' = F. Symmetrically, put 
A' = ri;y6p„+i(B') By the same way as the above, we can prove that AU A' 
and L = A' b' . 
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Theorem 1. For any regular language L C X* and a positive integer n, it is 
decidable whether L = for some nontrivial regular languages A, B G X* . 

Proof. Let A = {Ai, \ v G A„+i} and B = {B^ \ G A„}. By the preced- 
ing lemmata, A, B are finite sets of regular languages which can be effectively 

constructed. Assume that L = A |>["1 i? for some nontrivial regular languages 
A^B C X* . In this case, by Proposition 8, there exist regular languages A Q A' 
and B Q B' which are an intersection of languages in A and an intersection of 
languages in B, respectively. It is obvious that A' ,B' are nontrivial languages. 
Thus we have the following algorithm: 

1. Take any languages from A and let A' be their intersection. 

2. Take any languages from B and let B' be their intersection. 

3. Calculate A' Bh 

4. If A' >[”] B' = L, then the output is “YES”. 

5. If the output is “NO”, search another pair of {A',B'} until obtaining the 
output “YES”. 

6. This procedure terminates after a finite-step trial. 

7. Once we get the output “YES” , then L = A l> B for some nontrivial regular 
languages A, B C X*. 

8. Otherwise, there are no such decompositions. 

Let n be a positive integer. By iF{n,X), we denote the class of finite languages 
{L C X* I max{\u\ \ u G L} < n}. Then the following result by C. Campeanu 
et al. ([!]) can be obtained as a corollary of Theorem 9. 

Corollary 31 For a given positive integer n and a regular language A C X* , 
the problem whether A = B o C for a nontrivial language B G iF(ji,X) and a 
nontrivial regular language C C X* is decidable. 

Proof. Obvious from the following fact: li u,v € X* and luj < n, then uov = 
u I>1”1 v. 

The proof of the above corollary was given in a different way in [3] using 
the following result: Let A, L C X* be regular languages. Then it is decidable 
whether there exists a regular languages B C X* such that L = Ao B. 
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Abstract. Notions of Wang tiles, hnite state machines and recursive 
functions are tied together. We show that there is a natural way to 
simulate hnite state machines with output (transducers) with Wang tiles 
and we show that recursive (computable) functions can be obtained as 
composition of transducers through employing Wang tiles. We also show 
how a programmable transducer can be self-assembled using TX DNA 
molecules simulating Wang tiles and a linear array of DNA PX-JX 2 
nanodevices. 



1 Introduction 

In recent years there have been several major developments both experimentally 
and theoretically, that use DNA for obtaining three dimensional nanostructures, 
computation and as a material for nano-devices. 

Nanostructures. The inherently informational character of DNA makes it an 
attractive molecule for use in applications that entail targeted assembly. Ge- 
netic engineers have used the specificity of sticky-ended cohesion to direct the 
construction of plasmids and other vectors. Naturally-occurring DNA is a linear 
molecule in the sense that its helix axis is a line, although that line is typically 
not straight. Linear DNA molecules are not well-suited to serve as components 
of complex nanomaterials, but it is easy to construct DNA molecules with sta- 
ble branch points [20]. Synthetic molecules have been designed and shown to 
assemble into branched species [12,23], and more complex species that entail the 
lateral fusion of DNA double helices [21], such as DNA double crossover (DX) 
molecules [7], triple crossover (TX) molecules [13] or paranemic crossover (PX) 
molecules. Double and triple cross-over molecules have been used as tiles and 
building blocks for large nanoscale arrays [24,25]. In addition, three dimensional 
structures such as a cube [4], a truncated octahedron [29] and arbitrary graphs 
[10,19] have been constructed from DNA duplex and junction molecules. 

Computation. Theoretically, it has been shown that two dimensional arrays 
can simulate the dynamics of a bounded one dimensional cellular automaton 
and so are capable of potentially performing computations as a Universal Turing 
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machine [24] . Several successful experiments performing computation have been 
reported, most notably the initial successful experiment by Adleman [1] and the 
recent one from the same group solving an instance of SAT with 20 variables 
[3]. Successful experiments that have confirmed computation such as the binary 
addition (simulation of XOR) using triple cross-over molecules (tiles) have been 
reported in [16]. In [6] a 9-bit instance of the “knight problem” has been solved 
using RNA and in [8] a small instance of the maximal clique problem has been 
solved using plasmids. Theoretically it has been shown that by self-assembly of 
three dimensional graph structure many hard computational problems can be 
solved in one (constant) biostep operation [10,11]. 

Nano Devices. Based on the B-Z transition of DNA, a nano-mechanical device 
was introduced in [17]. Soon after, “DNA fuel” strands were used to produce 
devices whose activity is controlled by DNA strands [26,28]. The PX-JX 2 device 
introduced in [26] has two distinct structural states, differing by a half-rotation; 
each state is obtained by addition of a pair of DNA strands that hybridizes with 
the device such that the molecule is in either the JX 2 state or in the PX state. 

We consider it a natural step at this point to use the PX-JX 2 device and 
DNA self-assembly to develop a programmable finite state machine. A simulation 
of a finite state automaton that uses duplex DNA molecules and a restriction 
endonuclease to recognize a sequence of DNA was reported in [2] . Unfortunately 
this model is such that the DNA representing the input string is “eaten up” 
during the process of computation and no output (besides “accept-reject”) is 
produced. 

In this paper we exploit the idea of using DNA tiles (TX molecules) that 
correspond to Wang tiles to simulate a finite state machine with output, i.e., a 
transducer. Theoretically we show that by composition of transducers we can 
obtain all recursive functions and hence all computable functions (see Sections 
3, and 4). This result is not surprising considering the fact that iterations of 
a generalized sequential machine can simulate type-0 grammars [14,15]. How- 
ever, the explicit connection between recursive functions and Wang tiles has not 
been described before. Also, although the dynamics of transducers have been 
considered [22], the composition of transducers as means to obtain class of re- 
cursive functions represents a new way to tie together recursive functions, tiles 
and transducers. Moreover, in this paper, the purpose of Wang tiles is not to 
tile the plane, as they usually appear in literature, but merely to facilitate our 
description of the use of DNA TX molecules to simulate transducers and their 
composition. In particular, we assume the existence of “boundary colors” at some 
of the tiles which do not allow tiling extension beyond the appearance of such a 
color. The input tiles contain input “colors” on one side and a boundary color 
on the opposite side, such that the computation is performed in the direction of 
the input color. This is facilitated with composition colors on the sides (Section 

4). 

We go beyond two dimensions in this paper. We show how PX-JX 2 DNA 
devices (whose rotation uses three dimensional space) can be incorporated into 
the boundary such that the input of the transducer can be programmed and 
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potentially the whole finite-state nano-machine can be reusable. The paper is 
organized as follows. Section 2 describes the relationship between Wang tiles 
and transducers. It describes the basic prototiles that can be used to simulate 
a transducer. The composition of transducers obtained through Wang tiles is 
described in Section 3. Here we also show how primitive recursive functions can 
be obtained through composition of transducers. The general recursion (com- 
putable functions) simulated with Wang tiles is presented in Section 4. DNA 
implementation of transducers with TX molecules is described in Section 5 and 
the use of the PX-JX 2 devices for setting up the input sequence is described in 
Section 5.2. We end with few concluding remarks. 



2 Finite State Machines with Output: Basic Model 

In this section we show the general idea of performing a computation by a 
finite state machine using tiles. The actual assembly by DNA will be described 
in Section 5. We briefly recall the definition of a transducer. This notion is well 
known in automata theory and an introduction to transducers (Mealy machines) 
can be found in [9]. 

A finite state machine with output or a transducer is T = (X, S' , Q, <5, sq, F) 
where X and X' are finite alphabets, Q is a finite set of states, 6 is the transition 
function, sq G Q is the initial state, and F C Q is the set of final or terminal 
states. The alphabet X is the input alphabet and the alphabet X' is the output 
alphabet. We denote with X* the set of all words over the alphabet X. This 
includes the word with “no symbols” , the empty word denoted with A. For a 
word w = ai ••• Ok where Ui G X, the length of w denoted with |w| is k. For the 
empty word A, we have |A| = 0. 

The transition operation (5 is a subset of Q x X x X' x Q. The elements of 6 
are denoted with (g, a, a', q') or (g, a) i-^- (o', g') meaning that when T is in state 
g and scans input symbol a, then T changes into state g' and gives output symbol 
a' . In the case of deterministic transducers, i5 is a function S : Q x F ^ S' x Q , 
i.e., at a given state reading a given input symbol, there is a unique output state 
and an output symbol. Usually the states of the transducer are presented as 
vertices of a graph and the transitions defined with 6 are presented as directed 
edges with input/output symbols as labels. If there is no edge from a vertex 
g in the graph that has input label a, we assume that there is an additional 
“junk” state g where all such transitions end. This state is usually omitted from 
the graph since it is not essential for the computation. The transducer is said 
to recognize a string (or a word) w over alphabet X if there is a path in the 
graph from the initial state sq to a terminal state in F with input label w. The 
set of words recognized by a transducer 7 is denoted with L{7) and is called a 
language recognized by T. It is well known that finite state transducers recognize 
the class of regular languages. 

We concentrate on deterministic transducers. In this case the transition func- 
tion 6 maps the input word w G X* to a word w' G (X')*. So the transducer T 
can be considered to be a function from L{7) to (X')*, i.e., T : L{7) (X')*. 
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Examples: 

1. The transducer Ti presented in Figure 1 (a) has initial and terminal state sq. 
The input alphabet is {0,1} and the output alphabet is 27' = 0. It recognizes 
the set of binary strings that represent numbers divisible by 3. The states 
soj Si, S 2 represent the remainder of the division of the input string with 3. 




(a) (b) 

Fig. 1. A finite state machine that accepts binary strings that are divisible by 3. 
The machine in figure (b) outputs the result of dividing the binary string with 
3 in binary 



2. The transducer T2 presented in Figure 1 (b) is essentially the same as Ti 
except that now the output alphabet is also {0, 1}. The output in this case 
is the result of the division of the binary string with three. On input 10101 
(21 in decimal) the transducer gives the output 00111 (7 in decimal). 

3. Our next example refers to encoders. Due to manufacturing constraints of 
magnetic storage devices, the binary data cannot be stored verbatim on the 
disk drive. One method of storing binary data on a disk drive is by using the 
modified frequency modulation (MFM) scheme currently used on many disk 
drives. The MFM scheme inserts a 0 between each pair of data bits, unless 
both data bits are 0, in which case it inserts a 1. The finite state machine, 
transducer, that provides this simple scheme is represented in Figure 2 (a). 
In this case the output alphabet is 27' = (00, 01, 10}. If we consider rewriting 
of the symbols with 00 1— > a, 01 1-^ /3 and 10 1-^ 7 we have the transducer in 
Figure 2 (b). 

4. A transducer that performs binary addition is presented in Figure 2 (c). 
The input alphabet is 27 = {00,01,10,11} representing a pair of digits to 
be added, i.e., if a; = x\ - ■ ■ Xk and y = yi • • - yk are two numbers written 
in binary {xi,yi = {0,1}), the input for the transducer is written in form 
[xkyk] [xk-iyk-i] ••• [xiyi]- The output of the transducer is the sum of 
those numbers. The state Si is the “carry”, sq is the initial state and all 
states are terminal. In [16] essentially the same transducer was simultaed by 
gradually connecting TX molecules. 

For a given T = (27, 27', Q, <5, sq, A) the transition {q,a) 1-^ {a',q') schemati- 
cally can be represented with a square as shown in Figure 3. Such a square can 
be considered as a Wang tile with colored edges, such that left and right we have 
the state colors encoding the input and output states of the transition and down 
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1 ! 01 



o/oo 0 10 



Ij 01 

(a) 




Fig. 2. An encoder that produces the modified frequency modulation code. 



and up we have colors encoding input and output symbols. Then a computation 
with T is obtained by a process of assembling the tiles such that the abutted 
edges are colored with the same color. Only translation, and no rotations of the 
tiles are allowed. We describe this process in a better detail below. 



2.1 Finite State Machines with Tile Assembly 

Tiles: A Wang tile is a unit square with colored edges. A finite set of distinct 
unit squares with colored edges are called Wang prototiles. We assume that from 
each prototile there are arbitrarily large number of copies that we call tiles. A 
tile T with left edge colored I, bottom edge colored b, top edge colored t and right 
edge colored r is denoted with r = [/, 6, t, r]. No rotation of the tiles is allowed. 
Two tiles r = [I, b, t, r] and r' = [/', 6', t' , r'] can be placed next to each other, r 
to the left of r' iff r = I', and r' on top of r iff t = 6'. 

~ Computational tiles. For a transducer T with a transition of form (q, a) i— *■ 
{a\q') we associate a prototile [q,a,a',q'] as presented in Figure 3. If there 
are m transitions in the transducer, we associate m such prototiles. These 
tiles will be called computational tiles. 




tile: [ q,a,a’,q’] 

g 

transition: (q,a) i ► (a’,q) 



Fig. 3. A computational tile for a transducer. 



— Input and output tiles. Additional colors called border are added to the set 
of colors. These colors will not be part of the assembly, but will represent 
the boundary of the computational assembly. Hence the left border is dis- 
tinct from the right border. We denote these with (3i, fib, [3^, for left, bot- 
tom, right and top border. We assume that each input word is from S*a 
where a is a new symbol “end of input” that does not belong to S. For 
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o € ^ an input prototile Ta = [c, /3b,o, c] is added. There are |T'| input 
prototiles, each representing an input symbol. The left and right sides are 
colored with the same color, connect color c. For the “end of input” sym- 
bol a a prototile Tq, = [c, /?{,, a, /?r] is constructed. The output tiles are es- 
sentially the same as the input tiles, except they have the “top” border: 
= [c', a, /3t, c'] where c' is another connect color. The output row of tiles 
starts with Tie ft = [f3i, f3't, f3t,c'] and ends with Tright = [c'/3t,/?t,/3r]. For 
DNA implementation, Pt may be represented with a set of different motiffs 
that will fascilitate the “read out” of the result. With these sets of input 
and output tiles, every computation with T is obtained as a tiled rectangle 
surrounded by boundary colors. 

~ Start tiles and accepting ( end) tiles. Start of the input for the transducer T is 
indicated with the start prototile Srr = [Pi, Pbi T, c] where T is a special color 
specifying the transducer T. The input tiles can then be placed next to such 
a starting tile. For the starting state sq a starting prototile tq = [Pi,T, tj, sq] 
is constructed. Then tq can be placed on top of S^. The color rj can be 
equal to T if we want to iterate the same transducer or it can indicate 
another transducer that should be applied after T. If the computation is to 
be ended, rj is equal to PP indicating the start of “top boundary”. For each 
terminal state f € F we associate a terminal prototile Tf = [/, a, a, Pr] if 
another computation is required, otherwise, t/ = [f,a,P[,Pr\ which stops 
the computation. 

The set of tiles for executing a computation for transducer T 2 that performs 
division by 3 (see Figure 1 (b)) is depicted in Figure 4 (a). 

Computation: The computation is performed by first assembling the input 
starting with tile S and a sequence of input tiles ending with Tq,. The computation 
of the transducer starts by assembling the computation tiles according to the 
input state (to the left) and the input symbol (at the bottom) . The computation 
ends by assembling the end tile t/ which can lie next to both the last input tile 
and the last computational tile iff it ends with a terminal state. The output result 
will be read from the sequence of the output colors assembled with the second 
row of tiles and application of the output tiles. In this way one computation 
with T is obtained with a tiled 3 x n rectangle (n > 2) such that the sides of 
the rectangle are colored with boundary colors. Denote all such 3 x n rectangles 
with Dpi). By construction, the four boundary colors are different and since no 
rotation of Wang tiles is allowed, each boundary color can appear at only one 
side of such a rectangle. For a rectangle p G Dpi) we denote Wp the sequence 
of colors opposite of boundary Pb and w'p the sequence of colors opposite of 
boundary Pt. Then we have the following 

Proposition 21 For a transducer T with input alphabet F, output alphabet S' , 
and any p G F)p7) the following hold: 

(i) Wp G LpJ) and w'p G S' . 

(a) ^(wp) = w'p 




Transducers with Programmable Input by DNA Self-assembly 



225 




Start color 
forT 

border 

h 

border 

Start input tile 




stop computation 
next computation border 




start color start color 

Start computation tile 



end of input 
a 




border P^ 



End input tile 



stop computation 



next computation border P t 




Symbol (input) tiles 



End computation tile 



(a) 




Fig. 4. All prototiles needed for performing the computation of the transducer 
presented in Figure 1 and a simple computation over the string 10101 with result 



00111 . 



Moreover, if w €: L(7) and T(w) = w' , then there is a p G D(fT) such that 
w = Wp and w' = w'p. 



The tile computation of T 2 from Figure 1 (b) for the input string 10101 is 
shown in Figure 4 (b). The output tiles are not included. 
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3 Primitive Recursive Functions from Transducers 

In this section we show how the basic idea of modeling finite transducers by Wang 
tiles, with a small modification to allow composition of such computations, can 
perform computations of any primitive recursive function over the set of natural 
numbers. There is extensive literature on (primitive) recursive functions, and 
some introduction to the topic can be found in [5]. 

The primitive recursive functions are defined inductively as follows: 

Definition 31 

(i) Three initial functions: “zero” function z{x) = 0, “add one” function s{x) = 
X + 1 and the “i-th projection” function 7rf (xi, . . . , x„) = Xj are primitive 
recursive. 

(ii) Let X denote (xi, . . . ,x„). If f{x\, . . . ,Xk) and gi(x), . . . , (/fe(x) are primi- 
tive recursive, then the composition /i(x) = f{gi(x ), . . . , (/^(x)) is primitive 
recursive too. 

(Hi) Let /(xi, . . . , x„) and g{y, xi, . . . , x„, z) be two primitive recursive func- 
tions, then the function h defined recursively (recursion): 

/i(xi,...,x„,0) = /(xi,...,x„) 

h{Xi, . . . , X„, t + 1) = g{h{xi, ...,Xn,t),Xi,...,Xn,t) 

is also primitive recursive. 

Primitive recursive functions are total functions (defined for every input of 
natural numbers) and their range belongs to the class of recursive sets of numbers 
(see Section [5] and 4). Some of the primitive recursive functions include: x-\-y, 
X • y, x!, x^... In fact, in practice, most of the total functions (defined for all 
initial values) are primitive recursive. All computable functions are recursive 
(see Section 4) , and there are examples of such functions that are not primitive 
recursive (see for ex. [5]). 

In order to show that the previous tiling model can simulate every primitive 
recursive function we need to show that the initial functions can be modeled in 
this manner, as well as their composition and recursion. 

3.1 Initial Functions 

The input and the output alphabets for the transducers are equal: E = S' = 
{0, 1}. An input representing the number n is given with a non-empty list of O’s 
followed with n -I- 1 ones and a non-empty list of O’s, i.e., 0 . . . 01”+^0 ... 0. That 
means that the string OHIO represents the number 2. The input as an n-tuple 
X = (xi, . . . , x„) is represented with 

0 . . . OHl + ^O^l 1^2-Ko« 2 . . . QSn . . . q 

where Si > 0. This is known as the unary representation of x. In Figure 5, trans- 
ducers corresponding to the initial primitive recursive functions z{x), s(x), Uffs.) 
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are depicted. Note that the only input accepted by these functions is the cor- 
rect representation of x or x. The zero transducer has three states, go is initial 
and Q 2 is terminal. The transducer for the function s(a;) contains a transition 
that adds a symbol 0 to the input if the last 0 is changed into 1. This tran- 
sition utilizes the “end of input” symbol a. The state go is the initial and go 
and ! are terminal states for this transducer. The increase of space needed as 
a result of computation is provided with the end tile (denoted !) for go. This 
tile is of the form ! =!„ = [go, /3{,, a, /3r] which corresponds to the transition 
(?3i (q^j 0- DNA implementation, this corresponds to a tile whose bot- 

tom side has no sticky ends.) The i-th projection function [/" accepts as inputs 
strings 0 • • • • • • 0, i.e, unary representations of x and 

it has 2n -I- 1 states. Transitions starting from every state change I’s into O’s 
except at the state g^ where the i-th entry of I’s is copied. 




Fig. 5. The three initial functions of the primitive recursive functions 



3.2 Composition 

Let h be the composition /i(x) = /(gi(x), . . . , gfe(x)) where x = (a;i, . . . , a;„). 
Since the input of /i is x and there are k functions that need to be simultaneously 
computed on that input, we use a new alphabet S = {(oi, . . . , ak) \ G 27} as an 
aid. By the inductive hypothesis, there are sets of prototiles P{gi) that perform 
computations for gi with input x for each i = 1, . . . , fc. Also there is a set of 
prototiles P{f) that simulates the computation for /. The composition function 
h is obtained with the following steps. Each step is explained below. 

1. Check for correct input. 

2. Translate the input symbol a into fc-tuples (o, a, . . . , a) for a € S. 
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3. For i = 1, . . . ,k apply P{gi)- The start prototiles are adjusted as follows: If 
the current transducer Ti is to be followed by transducer then the start 
tile for Ti is [/3;, Ti, Tj, so] where sq is the initial state for Ti. 

4. For all i = 1, . . . , fc — 1 in the Fth coordinate of the input mark the end of 
the last 1. 

5. For alH = 2, ... ,k shift all I’s in the i-th coordinate to the right of all I’s in 
the i — 1 coordinate. This ensures a proper input for the tiles that compute 
/. After this, the fc-tuples contain at most one 1. 

6. Translate back fc-tuples into 1 if they contain a symbol 1, otherwise into 0. 

7. Apply the tiles that perform the computation for function /. 

Step 1 and 2 The input tiles are the same as described in Section 2. The transla- 
tion of the input from a symbol to a fc-tuple is obtained with the transducer Tfc 
depicted in Figure 6. Note that this transducer is also checking the correctness 
of the input. 



( 0....0) i/a.-i) 




initial 




Tk 



Fig. 6. The transducer Tk that translates symbols into fc-tuples 



Step 3 Each computational prototile for pi of the form [q,a,a',q'] is substituted 
with a set of prototiles [q, (ai, . . . , ai_i, a, . . . , a^), (oi, . . . , a^-i, o', . . . , Ofc), g'j 
where at is any symbol in the alphabet Eg. used by transducers for pi. The 
end tile !q, = /?{,, a, /J^] that increases the computational space is substituted 

with !(c,...,q) = [gi,/?6, (a, . . . ,a),/3rj. We call this set of prototiles P{pi). The 
idea is to use the tiles P{pi) to compute the function pi on coordinate i and leave 
the rest of the coordinates unchanged. 

Step 4 After the application of the tiles P{pi) to P{pk), the fc-tuples have the 
property that for each i the sequence of symbols that appears in the i-th co- 
ordinate represents a number rrii = p{x) of the form Wi = 0 • • • 01™“0 • • • Oa®* 
for some Si > 1. The end of input is obtained with the symbol (a, . . . , a). In 
order to prepare an input for the function /, all I’s in the i-ih. coordinate have 
to appear to the right of the first 0 after the I’s in the i — 1 coordinate for all 
i = 2, . . . , fc. Hence, first we mark the end of I’s in each coordinate with an 
additional symbol 7. The transducer Mi substitutes 7 instead of the first 0 after 
the sequence of I’s in the Ath coordinate. It has the following transitions (all 
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states have superscripts i which are avoided to simplify the reading, we denote 
y = (oi,. ..,afc) and y' = ) 



(so,y) 



(si,y) 



(S2,y) 



r (y, So) for Oi = 0 
\ (y, si) for Oi = 1 

when 1 is found in the z-th coordinate, change to si 
f (y, si) for a* = 1 
1 (y',S2) for a* = 0, a' = 7 
when all I’s are scanned, change the first 0 into 7 

(y,s2) 



Note that the starting tile for each Mi {i yf k) is [/?;, Mj, Mi+i, Sg]. For i = k, 
Mi+i is substituted with (T2 which is the function that start shifting with the 
second coordinate. Similarly, for i = k the end of input (a, . . . , a) in the end tile 
is substituted with (ct2, (T2, • . • , CT2). 



Step 5 Now the shifting of the z-th input to the right of the z — 1 input is 
achieved with the function ai. This function corresponds to a transducer with 
the following transitions (all states have superscripts z which are avoided to 
simplify the reading, hence we denote y = (oi, . . . , Ofc) and y' = {a{, . . . , a).)): 



(so,y) 



(si.y) 



(y,So) for Oi = 0,aj_i yf 7 
(y,D) for Oi = 0,a*_i = 7 
(y', si) for Qi = 1, ttj = a' , a' = 0 (j yf z) 

start the shift with si unless there is a 7 in the z — 1 coordinate 
before there is 1 in the z-th coordinate, in that case 
go to the final state D 

(y,si) for a^ = ^ 

(y, S2) for a, = 7, a' = 1, aj = a'Aj yf z) 



copy the Is, change the end 7 into 1 and go to 

^ (y', S3) for Qi = 0, a, a'i = 7 

record the end symbol 7 and change to S3, note that ai cannot be 1 
(y, S3) for Oi yf (Jj,y yf A 
(y', !y0 for y = A, a' = cji {j = I, . . . ,k) 
copy the input while there are input symbols, if no input 
is to be found indicate the end of input and go to the next shifting, 
expand space if necessary 
if z < fc then 



(s2,y) 



(s3,y) 



{D,y) ^{y,D) 

go to the end of input and change the shift to the next coordinate 
using the ending tile td = [D, (cr*, . . (cr*+i, . . . , cr*+i), / 3 r ] 

if z = fc then 

(L>,y) ^{y,D) with ending tile m = [£>, (dfe, ..., crj,), Tr, 
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Since the shifting is performed one at a time, potentially has to be re- 
peated many times. This is obtained by setting the “end of input” of the form 
((Ji, . . . , (Ti, . . . , (Ti) which indicates that the next shift is the function ai. The 
terminal states D and S3 make distinction between whether the shift has to 
be repeated (state S3) or the shifting of the z’th coordinates is completed (state 
D). When i = k the end tile at state D changes such that (cTfc , . . . ,ak) is sub- 
stituted with Tfi which is the transducer that translates symbols from ^-tuples 
back into symbols from S. 

Step 6 The transition from fc-tuples of symbols to a single symbol is done with 
the following simple transitions and only one state (note that after shifting, each 
n-tuple contains at most one 1): 

, , J 0 if y contains no I’s 

(9o,y) ^ I if y contains a 1 

The start tile for Tr is [AjTk)/)9o] indicating that the next computation has 
to be performed with tiles for function / and the end tile for Tr is [go, T/j, a, /?r] 
fixing the input for / in the standard form. After the application of Tr, the 
input reads the unary representation of (gi(x), . . . ,gfc(x)). 

Step 7 The composition is completed using tiles for /. These tiles exist by the 
inductive hypothesis. 

3.3 Recursion 

In order to perform the recursion, we observe that the composition and trans- 
lation into n-tuples is sufficient. Denote with x the n-tuple {x\, . . . ,x„). Then 
we have /i(x, 0) = /(x) for some primitive recursive function /. By induction we 
assume that there is a set of tiles that performs the computation for the function 
/. For h{-x.,t+ 1) = g{h{x,t),x,t), also by the inductive hypothesis we assume 
that there is a set of tiles that performs the computation of g{y,x,t). 

Set up the input. We use the following input 01*“''^j^x010 where x denotes 
the unary representation of the n-tuple x. We denote this with (t -|- 1 |x, 0). 
The symbol v is used to separate the computational input from the counter of 
the recursion. Since this input is not standard, we show that we can get to this 
input from a standard input for h which is given with an ordered (n -I- l)-tuple 
(x,y). First we add a 0 unary represented with string 010 at the end of the 
input to obtain (x,y, 0). This is obtained with the transducer depicted 

in Figure 7. Note that for k = I the translation function Tk (Figure 6) is in 
fact the identity function, hence, there is a set of prototiles that represents the 
function Id(x, y, 0). Consider the composition ld{U^^ 2 i U^+ 2 ^ • ■ • , )• 

This composition transforms (x, y, 0) into (y,x, 0). We obtain the desired entry 
for the recursion by marking the end of the first coordinate input (used as a 
counter for the recursion) with symbol u. This is obtained by adding the following 
transitions to Id(x, 0): (io,0) (0,zo), (ffi, 1) (l,*i), (*i,l) (l,*i) and 
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Check for n -tuple 
input with 
2n+l states 




Fig. 7. The transducer that increases the input from n-tuple x to (n-l- 1)- 

tuple (x, 0). 



(*i,0) {v,so) where io,fi are new states and sq is the starting state for 
Id(x, 0). With this, our entry is in the desired form of (y, a;, 0) = (t -I- 1 | x, 0). 

Execution of the recursion. Each transducer T that is used for computation 
of functions / and g is adjusted to T' which skips the input until symbol 7 is 
obtained. This is obtained with one additional state (c^,i) 1-^ for f = 0, 1 

and {v,sq) where sq is the starting state for T. As in the case of 

composition we further adjust the prototiles for functions / and g into prototiles 
P{f)' and P{g)' which read/write pairs of symbols, but the computation of / and 
g is performed on the first coordinate, i.e., every prototile of the form [q, a, a' , q'\ 
in the set of prototiles for / and g is substituted with [q, (a, 02), (o', 02), q'] where 
«2 are in {0,1}. Second coordinates are kept to “remember” the input for / and g 
that is used in the current computation and will be used in the next computation 
as well. 

In this case the recursion is obtained by the following procedure: 

~ Translate input (t-l- 1 1 x, 0) into (t-l- 1 1 x, 0) using the translation transducer 
T2 as presented in Figure 6 and adjusted with the additional state to skip 
the initial t + 1. Now each symbol a in the input portion (x, 0) is translated 
into a pair (a, a). 

— Apply P{fY, hence the result /(x) can be read from the first coordinates of 
the input symbols and the input (x, 0) is read in the second coordinates. 

~ Mark with M[ the end of input from the first coordinate. This is the same 
transducer Mi as used in the composition, except in this case there is an 
extra state that skips the counter t -|- 1 at the beginning. 

— Shift coordinates using tr^i i-6-j th® same transducer as ct 2 for the composition 
with the additional state to skip the counter. 

~ Translate back from pairs into {0,1} with T}j. Now the result reads as (t -I- 
1 1 /(x),x, 0), i.e., /i(x, 0) = /(x) is read right after symbol ly. 

— For an input (t -b 1 1 /(x), x, 0) reduce t+1 for one. This is done with trans- 
ducer presented in Figure 8 (a). The new input reads {t \ /(x),x, 0). 

~ Check for end of computation with transducer that accepts 0 • • • Quw for any 
word w. Note that the language 0 *j/(0 + 1)* is regular and so accepted by 
a finite state transducer that has each transition with output symbols same 
as input symbols. 
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Fig. 8. The transducer that reduces the count of t and adds 1 to the input for 
the recursion. When count down is reduced to 0, the recursion has been applied 
t+1 times 



~ For j = t, . . . ,0 perform the recursion on input (j \ yj,x,t — j): 

• Translate with T 2 leaving the first entry unchanged, and making all en- 
tries 2, . . . , n -I- 2 into pairs. 

• Perform P{gY on the first coordinate of these entries (skipping the first 

entry j). In other words, g is applied on the entry — j), and by 

the inductive hypothesis yj = h(x,t — j). 

• Mark the end of first coordinate input in the pairs of symbols with M[ . 

• Shift with (72. 

• Translate back from pairs into single symbols with T^. Now the result 
reads as {j \ g{yj,x,t-j),x,t-j). By definition, x, t-j) = g{h{x,t- 

- j) is in fact h{x,t-j + 1). Denote yj-i = g{yj,x,t-j). 

• For an input (j j yj-i,x,t—j) reduce j for one and increase the last entry 
by 1. This is done with transducer presented in Figure 8 (b). The new 
input now reads ( j — 1 1 x, t — j + 1). 

• Check for end of the loop with transducer that accepts 0 • • • Oi^w for any 
word w. 

— The end of the loop reads the following (0|?/ojX,t) and by construction 
h(x, t-l- 1) = yg. Now change all symbols before and after yo into 0 (note that 
yo appears right after ly). This can be obtained with the following transitions: 

(so, 0) 1 -^ (0, So) (sq, 1 ^) ^ (0, si) (si, 0) 1 -^ (0, si) (si, 1) 1 -^ (1, S 2 ) 
(S2, 1 ) 1-^ (1,S2) (S2,0) 1-^ (0,53) (S3,0) 1-^ (0,83) (S3, 1 ) 1-^ ( 0 , S3) 

~ Read out the result with upper border. 
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4 Recursive Functions as Composition of Transducers 

Recursive functions are defined through the so called minimization operator. 
Given a function g{x \, . . . , y) = g(x, y) the minimization /(x) = yyg{:x., y) is 
defined as: 

{ y if j/ is the least natural number such that 

= 0 

undefined if g{x, y) ^ 0 for all y 

A function /(x) is called total if it is defined for all x. The class of recursive 
or computable functions is the class of total functions that can be obtained 
from the initial functions z{x), s{x),U^{x) with a finite number of applications 
of composition, recursion and minimization. The Kleene normal form theorem 
shows that every recursive function can be obtained with only one application 
of minimization (see for example [5]). 

In order to show that all recursive (computable) functions can be obtained 
as a composition of transducers we are left to observe that the minimization 
operation can be obtained in this same way. As in the case of recursion we 
adjust the prototiles for function g into prototiles P{g) which contain pairs as 
symbols, but the computation of g is performed on the first coordinate i.e., 
every prototile of the form [q, a, a', q'] in the set of prototiles for g is substituted 
with [q, (a, 02 ), (a^ 02 ), q'] where 02 are in {0, 1}. Second coordinates are kept to 
“remember” the input for g. Now the minimization is obtained in the following 
way (all representations of x are in unary). 

— Fix the input x into (x, 0) by the transducer 

— Translate input (x, 0) into (x, 0) using the translation transducer T 2 as pre- 
sented in Figure 6. Now each symbol a in the input portion (x, 0) is translated 
into a pair (a, a). 

— Apply P{g), hence the result g{x) can be read from the first coordinates of 
the input symbols. The second coordinates read (x, 0). 

— Until the first coordinate reads 0 continue: 

• (*) Check whether ^(x) = 0 in the first coordinate. This can be done 
with a transducer which “reads” only the first coordinate of the input 
pair and accepts strings of the form O+IO'*' and rejects 
• If the transducer is in “accept” form, then 

* Change I from the first coordinate in the input into 0. 

* Translate back from pairs of symbols into a single symbol with Tr. 
* Apply stop. 

• If the transducer is in “reject” form, 

* Change all I’s from the first coordinate with O’s. (I.e., apply z{x) on 
the first coordinate.) 

* Apply to the second coordinate. Hence an input that reads 

from the second coordinate (x, y) is changed into (x,y-|- 1). 

* Copy the second coordinates to the first with transitions: (s, (oi, 02 )) 
1 -^ ((o2,a2),s). 
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* Apply P{g), i.e., compute g{x,y+l) on the first coordinate, go back 

to n. 



5 DNA Implementation 

5.1 The Basic Transducer Model 

Winfree et. al [24] have shown that 2D arrays made of DX molecules can simu- 
late dynamics of one-dimensional cellular automata (CA) . This provides a way 
to simulate a Turing machine, since at one time step the configuration of a one- 
dimensional CA can represent one instance of the Turing machine. In the case 
of tiling representations of finite state machines (transducers) each tile repre- 
sents a transition, such that the result of the computation with a transducer is 
obtained within one row of tiles. Transducers can be seen as Turing machines 
without left movement. Hence, the Wang tile simulation gives the result of a 
computation with such a machine within one row of tiles, which is not the case 
in the simulation of CA. A single tile simulating a transducer transition requires 
more information and DX molecules are not suitable for this simulation. Here 
we propose to use tiles made of triple crossover (TX) DNA molecules with sticky 
ends corresponding to one side of the Wang tile. An example of such molecule is 
presented in Figure 9 (b) with 3'ends indicated with arrowheads. The connector 
is a sticky part that does not contain any information but it is necessary for 
the correct assembly of the tiles (see below). The TX tile representing transi- 
tions for a transducer have the middle duplex longer than the other two (see the 
schematic presentation in Figure 13). It has been shown that such TX molecules 
can self-assemble in an array [13], and they have been linearly assembled such 
that a cumulative XOR operation has been executed [16]. However, we have yet 
to demonstrate that they can be assembled reliably in a programmable way in a 
two dimensional array. Progress towards using double crossover molecules (DX) 
for assembly of a 2D array that generates the Sierpinski triangle was recently 
reported by the Winfree lab [18]. 




Fig. 9. A computational tile for a FSM and a proposed computational TX 
molecule tile. 



Controlling the right assembly can be done by two approaches: (1) by regu- 
lating the temperature of the assembly such that only tiles that can bind three 
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sticky ends at a time hybridize, and those with less than three don’t, or (2) by 
including competitive imperfect hairpins that will extend to proper sticky ends 
only when all three sites are paired properly. 

An iterated computation of the machine can be obtained by allowing third, 
fourth etc. rows of assembly and hence primitive recursive functions can be 
obtained. The input for this task is a combination of DX and TX molecules as 
presented in Figure 10. The top TX duplex (not connected to the neighboring 
DX) will have the right end sticky part encoding one of the input symbols and 
the left sticky end will be used as connector. The left (right) boundary of the 
assembly is obtained with TX molecules that have the left (right) sides of their 
duplexes ending with hairpins instead of sticky ends. The top boundary contains 
different motifs (such as the half hexagon in Figure 11 (b)) for different symbols. 
For a two symbol alphabet, the output tile for one symbol may contain a motif 
that acts as a topographic marker, and the other not. In this way the output 
can be detectable by atomic force microscopy. 




Fig. 10. Input for the computational assembly. 



5.2 Programable Computations with DNA Devices 

Recent developments in DNA nanotechnology enable us to produce FSM’s with 
variable and potentially programmable inputs. The first of these developments is 
the sequence-dependent PX- JX 2 2-state nanomechanical device [26] . This robust 
device, whose machine cycle is shown in Figure 11 to the left, is directed by the 
addition of set strands to the solution that forms its environment. 

The set strands, drawn thick and thin, establish which of the two states the 
device will assume. They differ by rotation of a half-turn in the bottom parts 
of their structures. The sequence-driven nature of the device means that many 
different devices can be constructed, each of which is individually addressable; 
this is done by changing the sequences of the strands connecting the DX portion 
AB with the DX portion CD where the thick or thin strands pair with them. 
The thick and thin strands have short, unpaired extensions on them. The state 
of the device is changed by binding the full biotin-tailed complements of thick 
or thin strands, removing them from solution by magnetic streptavidin beads, 
and then adding the other strands. Figure 11 to the right shows that the device 
can change the orientation of large DNA trapezoids, as revealed by atomic force 
microscopy. The PX (thick strand) state leads to parallel trapezoids and the JX 2 
(thin strand) state leads to a zig-zag pattern. 

Linear arrays of a series of PX-JX 2 devices can be adapted to set the input 
of a FSM. This is presented in Figure 12 where we have replaced the trapezoids 
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Fig. 11. 



with double-triangle double diamonds. The computational set-up is illustrated 
schematically in Figure 13 with just two computational tiles made of DNA triple 
crossover (TX) molecules. 




Fig. 12. Linear array of a series of devices to set-up the input. 



Both diamonds and trapezoids are the result of edge-sharing motifs [27]. A 
key difference between this structure and the previous one is that the devices 
between the two illustrated double diamond structures differ from each other. 
Consequently, we can set the states of the two devices independently. The double- 
diamond structures contain domains that can bind to DX molecules; this design 
will set up the linear array of the input, see Figure 12. Depending on the state 
of the devices, one or the other of the double diamonds will be on the top 
side. The two DX molecules will then act similarly to bind with computational 
TX molecules, in the gap between them, as shown schematically in Figure 13. 
The system is designed to be asymmetric by allowing an extra diamond for the 
“start” and the “end” tiles (this is necessary in order to distinguish the top and 
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Fig. 13. Schematic view for a couple of first step computations. The linear array 
of DX and TX molecules that sets up the input is substituted only with a 
rectangle for each of the input symbol. 




Fig. 14. 



the bottom by in atomic force microscopy, (AFM) visualizations). The left side 
must contain an initiator (shown blunt at the top) and the right side will contain 
a terminator. The bottom assembly can be removed using the same Yurke-type 
techniques [28] that are used to control the state of the devices for removal of 
the set strands. Successive layers can be added to the device by binding to the 
initiator on the left, and to the terminator on the right. 

For this task, we have prototyped the assembly of the central 4-diamond 
patterns, connecting them with sticky ends, rather than PX-JX 2 devices. AFM 
images of these strings are shown on the left in Figure 14. We have also built a 
5-diamond system connected by devices. Its AFM pattern is shown on the right 
in Figure 14. 
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6 Discussion 

What is the feasibility of this proposed system for a finite state machine? The 
TX tiles have already been developed and have been successfully self-assembled 
into a periodic 2D array [13]. Thus, we feel it is a reasonable extension of this 
previous work to extend this system to prototiles in a Wang tile system. Likewise, 
we have developed and purified a short double diamond-like system of connected 
PX-JX 2 devices. The experimental problems that we envision are the following: 
(i) Purifying a long linear system so that we do not have a mixture of input 
components. We expect that this can be done as done with the short system, 
using specifically biotinylated molecules on the end members of the input array. 
If the system is to be run similarly to the PX-JX 2 device previously constructed, 
we will need to remove the biotin groups from the system before operation, (ii) 
A second problem that we envision is that three sticky ends need to bind, but 
two sticky ends must be rejected. We have noted above two different methods 
by which this goal might be achieved. The most obvious form of competition 
is to make all of the sticky ends imperfect hairpins that require perfect mates 
to come undone. Experimentation will be required to identify the most effective 
protocol for this purpose. 

An iterated computation (or a composition) of the machine can be obtained 
by allowing third, fourth etc. rows of assembly and hence recursive functions 
can be obtained. Although theoretically this has already been observed, the pro- 
posed way to simulate a transducer provides a much faster computational device 
than simulating a CA. As shown by examples in Figure 1 and 2 functions like 
addition and division as well as encoding devices can be simulated with a sin- 
gle row of tiles. However, the theoretical proofs for composition and recursion 
use transducers that perform a single rewriting step and are not much different 
from the computational steps simulated by CA. It is a theoretical question to 
characterize the functions that can be computed by a single application of a 
transducer. Knowing these functions, we can get a better understanding of the 
computational complexity of their iterations and compositions and with that 
a better understanding of the practical power of the proposed method. From 
another point, it is clear that errors can be minimized with minimal number 
of prototiles. It is known that the iterations of a four-state gsm provide uni- 
versal computation [15], but what will be the minimal number of transitions 
in a transducer whose iterations provide universal computation remains to be 
determined. 

With this paper we have presented ideas how to obtain a finite state nano- 
machine that is programmable, produces an output and is potentially reusable. 
Success in this objective relies on contributions of multiple disciplines, includ- 
ing biochemistry and information theory. Chemistry and biochemistry provide 
the infrastructure for dealing with biomolecules such as DNA and RNA. The 
development of programmable nano-devices provides us with a tool for building 
new (bio)molecules. Information theory provides the algorithms as well as the 
understanding of the potential and limitations of these devices as information 
processing machines. In turn, we expect that information technology will gain 
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improved molecular based methods that ultimately will prove to be of value for 
difficult applications and computational problems. 
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Abstract. The set of all sequences that are generated by a biomolecular 
protocol forms a language over the four letter alphabet A = {A, G, C,T}. 
This alphabet is associated with a natural involution mapping 6, A\-^ T 
and G (7 which is an antimorphism of A* . In order to avoid undesir- 
able Watson-Crick bonds between the words (undesirable hybridization), 
the language has to satisfy certain coding properties. In this paper we 
build upon an earlier initiated study and give general methods for ob- 
taining sets of code words with the same properties. We show that some 
of these code words have enough entropy to encode {0, 1}* in a symbol- 
to-symbol mapping. 



1 Introduction 

In bio-molecular computing and in particular DNA based computations and 
DNA nanotechnology, one of the main problems is associated with the design 
of the oligonucleotides such that mismatched pairing due to the Watson-Crick 
complementarity is minimized. In laboratory experiments non-specific hybridiza- 
tions pose potential problems for the results of the experiment. Many authors 
have addressed this problem and proposed various solutions. Common approach 
has been to use the Hamming distance as a measure for uniqueness [3,8,9,11,19]. 
Deaton et al. [8,11] used genetic algorithms to generate a set of DNA sequences 
that satisfy predetermined Hamming distance. Marathe et al. [20] also used 
Hamming distance to analyze combinatorial properties of DNA sequences, and 
they used dynamic programing for design of the strands used in [19]. Seeman’s 
program [23] generates sequences by testing overlapping subsequences to enforce 
uniqueness. This program is designed for producing sequences that are suitable 
for complex three-dimensional DNA structures, and the generation of suitable 
sequences is not as automatic as the other programs have proposed. Feldkamp 
et al. [10] also uses the test for uniqueness of subsequences and relies on tree 
structures in generating new sequences. Ruben at al. [22] use a random gener- 
ator for initial sequence design, and afterwards check for unique subsequences 
with a predetermined properties based on Hamming distance. One of the first 
theoretical observations about number of DNA code words satisfying minimal 
Hamming distance properties was done by Baum [3]. Experimental separation 
of strands with “good” codes that avoid intermolecular cross hybridization was 
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reported in [7] . One of the more significant attempt to design a method for gen- 
erating good codes of uniform length is obtained by Arita and Kobayashi [2]. 
They design a single template DNA molecule, encode it binary and by taking a 
product with already known good binary code obtain a sequence of codes that 
satisfy certain minimal distance conditions. In this paper we do not consider the 
Hamming distance. 

Every biomolecular protocol involving DNA or RNA generates molecules 
whose sequences of nucleotides form a language over the four letter alphabet 
A = {A, G, C,T}. The Watson-Crick (WC) complementarity of the nucleotides 
defines a natural involution mapping 9, A T and G C which is an anti- 
morphism of A*. Undesirable WC bonds (undesirable hybridizations) can be 
avoided if the language satisfies certain coding properties. Tom Head considered 
comma-free levels of a language by identifying the maximal comma-free sublan- 
guage of a given L C A*. This notion is closely related to the ones investigated 
in this paper by taking 9 to be the identity map I. In [13], the authors introduce 
the theoretical approach followed here. Based on these ideas and code-theoretic 
properties, a computer program for generating code words is being developed [15] 
and additional properties were investigated in [14] . Another algorithm based on 
backtracking, for generating such code words is also developed by Li [18]. In 
particular for DNA code words it was considered that no involution of a word 
is a subword of another word, or no involution of a word is a subword of a com- 
position of two words. These properties are called 0-compliance and 0-freedom 
respectively. The case when a DNA strand may form a hairpin, (i.e. when a 
word contains a reverse complement of a subword) was introduced in [15] and 
was called 0-subword compliance. 

We start the paper with definitions of coding properties that avoid inter- 
molecular and intramolecular cross hybridizations. The definitions of 0-compliant 
and 0-free languages are same as the ones introduced in [13]. Here we also con- 
sider intramolecular hybridizations and subword hybridizations. Hence, we have 
two additional coding properties: 0-subword compliance and 9-k-code. We make 
several observations about the closure properties of the code word languages. 
Section 3 provides several general methods for constructing code words with 
certain desired properties. For each of the resulting sets we compute the infor- 
mational entropy showing that they can be used to encode {0, 1}*. We end with 
few concluding remarks. 



2 Definitions and Closure Properties 

An alphabet A is a finite non-empty set of symbols. We will denote by A the 
special case when the alphabet is {A, G,C,T} representing the DNA nucleotides. 
A word u over A is a finite sequence of symbols in A. We denote by A* the set 
of all words over A, including the empty word 1 and, by A’*', the set of all non- 
empty words over A. We note that with concatenation. A* is the free monoid 
and A+ is the free semigroup generated by A. The length of a word m = ai • • • a„ 
is n and is denoted with |u|. 
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For words representing DNA sequences we use the following convention. A 
word u over A denotes a DNA strand in its 5' — > 3' orientation. The Watson- 
Crick complement of the word u, also in orientation 5' — > 3' is denoted with 
u. For example if rt = AGGC then u= GGGT. There are two types of un- 
wanted hybridizations: intramolecular and intermolecular. The intramolecular 
hybridization happens when two sequences, one being a reverse complement of 
the other appear within the same DNA strand (see Fig. 1). In this case the DNA 
strand forms a hairpin. 



V 




(a) 



w=uvux, |m I = A:, \v\ = m 







(b) 



Fig. 1. Intramolecular hybridization (0-subword compliance): (a) the reverse comple- 
ment is at the beginning of the 5' end, (b) the reverse complement is at the end of the 
3b The 3' end of the DNA strand is indicated with an arrow. 



Two particular intermolecular hybridizations are of interest (see Fig. 2). In 
Fig. 2 (a) the strand labeled it is a reverse complement of a subsequence of the 
strand labeled v, and in the same figure (b) represents the case when u is the 
reverse complement of a portion of a concatenation of v and w. 
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Fig. 2. Two types of intermolecular hybridization: (a) (0-compliant) one code word is 
a reverse complement of a subword of another code word, (b) (0-free) a code word is a 
reverse complement of a subword of a concatenation of two other code words. The 3' 
end is indicated with an arrow. 



Throughout the rest of the paper, we concentrate on finite sets X C E* that 
are codes such that every word in X~^ can be written uniquely as a product 
of words in X. In other words, X* is a free monoid generated with X. For 
the background on codes we refer the reader to [5]. We will need the following 
definitions: 

Pref(rc) = {u | G E* , uv = w} 

Suff(w) = {m I G E*, vu = w} 

Sub(?n) = {u\3vi,V 2 € E* ,viuv 2 = w} 

We define the set of prefixes, suffixes and subwords of a set of words. Similarly, 
we have SufTfc(w) = Suff(r(;) n E’^, Preffc(r<;) = Pref(w) n E^ and Subfe('u;) = 
Sub(w) n E'^. 
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We follow the definitions initiated in [13] and used in [15,16]. 

An involution 0 : A — > A of a set A is a mapping such that equals the 
identity mapping, 9{0{x)) = x,\/x G S. 

The mapping v \ A ^ A defined by v{A) = T, v{T) = A, v{C) = G, 
iy(G) = C is an involution on A and can be extended to a morphic involution of 
A*. Since the Watson-Crick complementarity appears in a reverse orientation, 
we consider another involution p : A* A* defined inductively, p{s) = s for 
s G A and p{us) = p{s)p{u) = sp{u) for all s G Z\ and u G A* . This involution is 
antimorphism such that p{uv) = p(y)p(u). The Watson-Crick complementarity 
then is the antimorphic involution obtained with the composition vp = pv. Hence 
for a DNA strand u we have that pv{u) = vp{u) = u. The involution p reverses 
the order of the letters in a word and as such is used in the rest of the paper. 

For the general case, we concentrate on morphic and antimorphic involutions 
of S* that we denote with 9. The notions of 6*-free and 0-compliant in 2, 3 of 
Definition 21 below were initially introduced in [13]. Various other intermolecular 
possibilities for cross hybridizations were considered in [16] (see Fig. 3). All of 
these properties are included with 9-k-code introduced in [14] (4 of Definition 
21). 

Definition 21 Let 9 : S* S* he a morphic or antimorphic involution. Let 
X <G S* he a finite set. 

1. The set X is called 0(A:, toi, m 2 )-subword compliant if for all u G E* such 
that for all u G we have E*uE"^9{u)E* n A = 0 for m\ < m < m 2 . 

2. We say that X is called 0-compliant if E*9{X)E'^(^X = 0 and E^9{X)E*G] 
A = 0. 

3. The set X is called 9-bee if X^ n E~^9{X)E~^ = 0. 

4 . The set X is called 9-k-code for some k > 0 z/Subfe(A) n Subfc(0(A)) = 0. 

5. The set X is called strictly 9 if X' H 6*(A') = 0 where A' = A \ {1}. 

The notions of prefix, suffix (subword) compliance can be defined naturally 
from the notions described above, but since this paper does not investigate these 
properties separately, we don’t list the formal definitions here. 

We have the following observations: 

Observation 22 In the following we assume that k < min{ |a;| : x G A}. 

1. A is strictly 0-compliant iff E*9{X)E* n A = 0. 

2. If A is strictly 0-free then A and 9{X) are strictly 0-compliant and 9{X) is 
9-bee. 

3. If A is 9-k-code, then A is 9-k' -code for all k' > k. 

4. A is a 9-k-code iff 9{X) is a 9-k-code. 

5. If A is strictly 9 such that A^ is 0(fc, 1, m) -subword compliant, then A is 
strictly 9-k-code. 

6. If A is a 9-k-code then both A and 9{X) are 9{k, l,m)-subword compliant, 
9{k, 1, m) prefix and suffix compliant for any m > 1, 0-compliant. If fc < 
for all X G A then A is 0-free and hence avoids the cross hybridizations as 
shown in Fig. 1 and 2. 
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\u\=k 



Fig. 3. Various cross hybridizations of molecules one of which contains subword of 
length k and the other its complement. 



7. If X is a 6-k-code then X and 0{X) avoids all cross hybridizations of length 
k shown in Fig. 3 and so all cross hybridizations presented in Fig. 2 of [16]. 

It is clear that 0-subword compliance implies 0-prefix and 0-suffix compliance. 
We note that when 0 = pzz, the 0(/c, toi, m 2 )-subword compliance of the code 
words X C A* does not allow intramolecular hybridization as in Fig. 1 for a pre- 
determined k and mi < m < m 2 . The maximal length of a word that together 
with its reverse complement can appear as subwords of code words is limited 
with k. The length of the hairpin, i.e. “distance” between the word and its re- 
versed complement is bounded between mi and m 2 . The values of k and mi, m 2 
would depend on the laboratory conditions (ex. the melting temperature, the 
length of the code words and the particular application) . In order to avoid inter- 
molecular hybridizations as presented in Fig. 2, X has to satisfy 0-compliance 
and 0-freedom. Most applications would require X to be strictly 0. The most 
restricted and valuable properties are obtained with 0-A:-code, and the analysis 
of this type of codes also seems to be more difficult. When X is 0-fc-code, all 
intermolecular hybridizations presented in Fig. 3 are avoided. We include several 
observations in the next section. 

2.1 Closure Properties 

In this part of the paper we consider several closure properties of languages that 
satisfy some of the properties described with the Definition 21. We concentrate 
on closure properties of union and concatenation of “good” code words. From 
practical point of view, we would like to know under what conditions two sets of 
available “good” code words can be joined (union) such that the resulting set is 
still a “good” set of code words. Also, whenever ligation of strands is involved, 
we need to consider concatenation of code words. In this case it is useful to 
know under what conditions the “good” properties of the codewords will be 
preserved after arbitrary ligations. The following table shows closure properties 
of these languages under various operations. 
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Table 2.1 Closure Properties 





0-subword compl. 


0-compl. 


0-free 


0-fc-code 


Union 


Yes 


No 


No 


No 


Intersection 


Yes 


Yes 


Yes 


Yes 


Complement 


No 


No 


No 


No 


Concatenation {XY, X Y) 


No 


Y/N 


No 


No 


Kleene * 


No 


No 


Yes 


No 



Most of the properties included in the table are straight forward. We add 
couple of notes for clarification. 

— 0(fc, TOi, TO 2 )-subword compliant languages are not closed under concatena- 
tion and Kleene*. For example consider the set X = {aba'^} C {a,b}* with 
the morphic involution 9 : a b,b a. Then X is 0(2, 2, 4)-subword 
compliant. But X“^ = {aba^ba'^} is not 0(2, 2, 4)-subword compliant, i.e. X^ 
contains {ab)a^6{ab)a. 

— When 0 is a morphism 0-compliant languages are closed under concatena- 
tion, but when 0 is antimorphism, they may not be. Consider the following 
example: Xi = {a^,ab} and X 2 = {b^,aba} with the antimorphic involution 
9 : a ^ b,b ^ a. Then ab^ € ^ 1^2 and bab^ € 9(XiX2). 

The next proposition which is a stronger version of Proposition 10 in [13], shows 
that for an antimorphic 0, concatenation of two distinct 0-compliant or 0-free 
languages is 0-compliant or 0-free whenever their union is 0-compliant i.e. 0-free 
respectively. The proof is not difficult and is omitted. 

Proposition 23 Let X,V C be two 9-compliant (9-free) languages for a 
morphic or antimorphic 9. If X U Y is 9-compliant (9-free) then XY is 9- 
compliant (9 -free). 

3 Methods to Generate Good Code Words 

In the previous section closure properties of “good” code words were obtained. 
With the constructions in this section we show several ways to generate such 
codes. Many authors have realized that in the design of DNA strands it is helpful 
to consider three out of the four bases. This was the case with several successful 
experiments [4,9,19]. It turns out that this, or a variation of this technique can 
be generalized such that codes with some of the desired properties can be easily 
constructed. In this section we concentrate on providing methods to generate 
“good” code words X and for each code X the entropy of X* is computed. The 
entropy measures the information capacity of the codes, i.e., the efficiency of 
these codes when used to represent information. 

The standard definition of entropy of a code X C E* uses probability distri- 
bution over the symbols of the alphabet of X (see [5]). However, for a p-symbol 
alphabet, the maximal entropy is obtained when each symbol appears with the 
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same probability ^ . In this case the entropy essentially counts the average num- 
ber of words of a given length as subwords of the code words [17]. From the 
coding theorem, it follows that {0, 1}* can be encoded by X* with E {0, 1} if 
the entropy of X* is at least log 2 ([1], see also Theorem 5.2.5 in [6]). The codes 
for 0-free, strictly 6*-free, and 0-fc-codes designed in this section have entropy 
larger than log 2 when the alphabet has p = 4 symbols. Hence, such DNA codes 
can be used for encoding bit-strings. 

We start with the entropy definition as defined in [6]. 

Definition 31 Let X he a code. The entropy of X* is defined by 

h{X) = /im„^oo-log|Sub„(X*)|. 
n 

If G is a deterministic automaton or an automaton with a delay that recog- 
nizes X* and Ag is the adjacency matrix of G, then by Perron-Frobenius theory 
Ag has a positive eigen value ft and the entropy of X* is log fl (see Chapter 4 
of [6]). We will use this fact in the following computations of the entropies of 
the designed codes. In [13], Proposition 16, authors designed a set of DNA code 
words that is strictly 0-free. The following propositions show that in a similar 
way we can construct codes with additional “good” propoerties. 

In what follows we assume that 27 is a finite alphabet with ]27j >3 and 
0 : 27 ^ 27 is an involution which is not identity. We denote with p the number 
of symbols in 27. We also use the fact that X is (strictly) 0-free iff X* is (strictly) 
6*-free, (Proposition 4 in [14]). 

Proposition 32 Let a,b G S be such that for all c G E \ {a,b}, 0(c) ^ {a,,b}. 
Let X = Ufci a'"(27 \ {a, b})''b’^ for a fixed integer m > 1. 

Then X and X* are 6-free. The entropy of X* is such that log(p—2) < H(X*). 

Proof. Let xi,X 2 ,y G X such that X\X 2 = s9(y)t for some s,t G 27+ such 
that Xi = a'^pb’^, X 2 = a'^qh'^ and y = a’^rb’^, for p,q,r G (27 \ {a, 6})+. 
Since 9 is an involution, if 9(a) yf a,b, then there is a c G 27 \ {a, 6} such 
that 9(c) = a, which is excluded by assumption. Hence, either 9(a) = a or 
9(a) = b. When 9 is morphic 9(y) = 9(a"^)9(r)9(b"^) and when 9 is antimorphic 
9(y) = 9(b'^)9(r)9(a™). So, 9(y) = d^9(r)b'^ or 9(y) = b'^9(r)a^. Since X\X 2 = 
a'^pb'^a'^qb'^ = sb'^9(r)a^t or xiX 2 = a^pb'^a'^qh'^ = sa^9(r)b'^t the only 
possibilities for r are 9(r) = p or 9(r) = q. In the first case s = 1 and in the 
second case t = 1 which is a contradiction with the definition of 6*-free. Hence X 
is 0-free. 

Let A = (V,E,X) be the automaton that recognizes X* where V = 
{1, ...,2m + 1} is the set of vertices, E C V x E x V and \ : E ^ E (with 
(i,s,j) 1 -^- s) is the labeling function. 

An edge (i, s,j) is in E if and only if: 

{ a, for l<f<m, j = i + 1 
b, for TO -|- 2 < f < 2m, j = i + 1, and i = 2m + l,j = 1 
s, for f = TO-|-l,TO-|-2, j=to-|-2, sG27 \ {a, b} 
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Then the adjacency matrix for Ais & (2m + 1) x (2m + 1) matrix with ijth entry 
equal to the number of edges from vertex i to vertex j. Then the characteristic 
polynomial can be computed to be det(xl— ^/) = (— 2 — ^) + (— l)^’"(p— 
2). The eigen values are solutions of the equation 2) — +p— 2 = 0 

which gives p — 2 = p — The largest real value for p is positive. Hence 

0 < < 1, i-e., p -2 < p <p-in 

In the case of the DNA alphabet, p = 4 and for m = 1 the above characteristic 
equation becomes p^ — 2p^ — 2 = 0. The largest real value of p is approximately 
2.3593 which means that the entropy of X* is greater than log 2. 

Example 1. Consider the DNA alphabet A with 9 = pv. Let m=2 and choose A 
and T such that X C Ur=i C}*T^. Then X and so X* is 0-free. 

Proposition 33 Choose distinct a,b,c€ E such that 9{c) ^ a, b 9(a) yf a and 
9(b) yf b. Let X = IJSi for some m > 2. 

Then X, and so X* is strictly 9-free. The entropy of X* is such that 
log(p^) < H(X*) < log((p— 1 + 1)^). 

Proof. (Sketch) Suppose there are x,xi,X 2 & X such that s9(x)t = 
X\X 2 for some s,t G A+. Let x = a'"sics 2 C..Sfec 6 "‘ then 9(x) is either 
9(a^)9(s\c...Skc)9(b'^) or 9(b'^)9(s\c...Skc)9(a^) which cannot be a proper sub- 
word of X\X 2 for any X\,X 2 G X. Hence X is 0-free. 

Let A = (V,E,X) be the automaton that recognizes X* where V = 
{1, ..., 3m} is the set of vertices, E C V x ExV is the set of edges and X: E ^ E 
(with (i,s,j) s) is the labeling function. 

An edge (i, s,j) is in E if and only if: 

{ a, for l<t<m, j = i + 1 

b, for 2m -I- 1 < t < 3m — 1, j = i + 1, and i = 3m, j = i + 1 

c, for i = 2m, j = m + 1, and i = 2m, j = 2m -I- 1 

t, for m + I < i < 2m — lj = z-|-ltGA' 

Note that this automaton is not deterministic, but it has a delay 1, hence 
the entropy of X* can be obtained form its adjacency matrix. Let A be 
the adjacency matrix of this automaton. The characteristic equation for A is 
-(p)3™ + (p)2>— 1 + p^-1 = 0. This implies p^~^ = = p^ - 

Since p is an integer and 0 < < 1, we have p~^ < p.O 

For the DNA alphabet, p = 4 and for m = 2 the above characteristic equation 
becomes p^ — Ap'^ — A = 0. Solving for p, the largest real value of p is 2.055278539. 
Hence the entropy of X* is greater than log 2. 

Example 2. Consider A and 9 = pu and let m = 2, a = A, c = C, 6 = G. 

Then X = AA(ACyGG and X* are strictly 6 *-free. 

With the following propositions we consider ways to generate 0-subword com- 
pliant and 0 -fc-codes. 
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Proposition 34 Choose a,b G S such that 0{a) = b. 

Let X = {a,6})'=6)* for k > 1. 

Then X is 9{k,k,k)~ subword compliant. The entropy of X* is such that 
h{X*) >log((p-2)^). 

Proof. We use induction on i. For i = 1, X = a^~^B^b, where B = E — {a, b}. 
Then for every x G X, |a;| = 2k. Hence X is 0(fc, A:, fc)-subword compliant. 
Assume that the statement is true for i = n. Let X G X C \ 

{a,b})’^by = Suppose x G E*uE>^9{u)E* for some u G E’^ . 

Then there are several ways for 9{u) to be a subword of x. If 9{u) is a subword 
of B’^bB^ then u is a subword of B^aB^ which is not possible. If 9{u) is a 
subword of bB’^ then u is a subword of either B^a or aB^ . In the first case u 
would not be a subword of X and in the second case x G E*uE*9(u)E* for some 
t y k which is a contradiction. If 9{u) G B^ then u G B’^ which is a contradiction 
to our assumption that fc > I. Hence X is 9(k, k, k) subword compliant. 

Let A = {V, E, A) be the automaton that recognizes X* where V = {1, ..., 2k} 
is the set of vertices, E C V x E x V and X : E ^ E (with {i, s,j) s) is the 
labeling function. 

An edge (z, s,j) is in E if and only if: 

{ a, for 1 < z < A: — 1, j = z + 1 

b, for z = 2k, j = k, and z = 2k, j = 1 

s, for k < i < 2k — 1, j = i + lsGE \ {a, b} 

This automaton is with delay 1. Let A be the adjacency matrix of this automaton. 
The characteristic equation is — (p— — (p— 2)^ = 0. Then (p— 2)^ = 
~ ■ Since p — 2 is an integer p > 1. Hence p > (p — 2)^+r . □ 

For the DNA alphabet, p = 4 and for A: = 2 the above characteristic equation 
becomes p^ — 4p — 4 = 0. When we solve for p we get the largest real value of p 
to be 1.83508 which is greater than the golden mean (1.618), but not quite gets 
to log 2. In this case encoding bits with symbols from E is not possible with X* , 
although as k approaches infinity, p approaches 2.D 

Example 3. Consider A with 9 = pv. Let k = 2 and choose A and T so that 
X = U"=i C}^T)L Clearly X is 9{2, 2, 2)-subword compliant. 

Proposition 35 Let a,b G E such that 9{a) = b and let X = a^“^((A7 \ 

{a, b}y~‘^by. 

Then X is 9{k,l,m)-suhword compliant for k > 3 and m > 1. Moreover, 
when 9 is morphia then X is a 9-k-code. The entropy of X* is such that log((p — 
2)^) < h{X*). 

Proof. Suppose there exists x G X such that x = rus9{u)t for some r,t G E*, 
u,9{u) G E^ and s G Af™. Let x = a^“^si6s26...s„6 where Si G {E\ a,b)^~‘^ . 
Then the following are all possible cases for zz: 

1. zz is a subword of a^~^s\, 

2. zz is a subword of as\b. 
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3. M is a subword of si 6 s 2 , 

4. M is a subword of bsib for some i < n. 

In all these cases, since 9{a) = b, 9{u) is not a subword of x . Hence X is 9{k, 1, m) 
subowrd compliant. 

Let A = (V,E,X) be the automaton that recognizes X* where V = 

,2k — 2} is the set of vertices, E C V x E x V and X : E ^ E (with 

(i,s,j) s) is the labeling function. 

An edge (z, s,j) is in E if and only if: 

{ a, for 1 < z < fc — 1, j = z + 1 

b, for i = 2k — 2, j = k, and i = 2k — 2, j = 1 

s, for k<i< 2k — 3, j = i+ ls€E \ {a, b} 

This automaton is with delay 1. 

Let A be the adjacency matrix of this automaton. The characteristic equation 
is (-Az)"'=-" -{p- 2)'=-2/x'=-i -{p- 2f-^ = 0. So {p - 2f-^ = 

Since 0 < <1 , (p-2)^ < p.D 

For the DNA alphabet, p = 4 and for fc = 3 the above characteristic equation 
becomes — 2/z^ — 2 = 0. Solving for p, the largest real value is 1.6528 which is 
greater than the golden mean (1.618), but less than 2. Again as in the previous 
case, the asymptotic value for p is 2 when k approaches infinity. 

Example 4- Consider A with 9 = pv and choose k = 3. Then 
X = AA{{G, C}Ty is 9{3, 1, m)-subword compliant for any m > 1. 

As other authors have observed, note that it is easy to get 0-fc-code if one 
of the symbols in the alphabet is completely ignored in the construction of the 
code X. 

Proposition 36 Assume that 9{a) yf a for all symbols a € E. Let b,c € E such 
that 9{b) = c and let X = (J^i a^“^((A \ {c})^“^ 6 )* for k >3. 

Then X and X* are a 9-k-code. The entropy of X* is such that log{{p — 

1 )^) < h{x*). 

Proof. The fact that X* is a 9-k-code is straight forward, since every subword 
of a: S A of length k is either power of a or contains the symbol b. 

Let A = (V,E,X) be the automaton that recognizes X* where V = 

,2k — 2} is the set of vertices, E C V x E x V and X : E ^ E (with 

{i,s,j) s) is the labeling function and as in the previous propositions, the 
edges (i,s,j) are defined such that: 

{ a, for 1 < z < fc — 1 , J = z + 1 

b, for i = 2k — 2, j = k, and i = 2k — 2, j = 1 

s, for k < i < 2k — 3, j = i -\- 1 s £ E \ {c} 



This automaton is with delay 1. 
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Let A be the adjacency matrix of this automaton with the characteristic 
equation — (p — 1)^“^ = 0. This implies (p — 1)^“^ = 

M ~ ■ We are interested in the largest real value for p. Since /x > 0, we 

fc -1 k -2 

have 0 < < 1 which implies (p — < p.D 

For the DNA alphabet, p = 4 and for k = 4 the above estimate says that 
/i > 33 >2. Hence the entropy of X* in this case is greater than log 2. 

4 Concluding Remarks 

In this paper we investigated theoretical properties of languages that consist of 
DNA based code words. In particular we concentrated on intermolecular and 
intramolecular cross hybridizations that can occur as a result that a Watson- 
Crick complement of a (sub)word of a code word is also a (sub)word of a code 
word. These conditions are necessary for a design of good codes, but certainly 
may not be sufficient. For example, the algorithms used in the programs devel- 
oped by Seeman [23], Feldkamp [10] and Ruben [22], all check for uniqueness of 
fc- length subsequences in the code words. Unfortunately, none of the properties 
from Definition 21 ensures uniqueness of fc- length words. Such code word prop- 
erties remain to be investigated. We hope that the general methods of designing 
such codewords will simplify the search for “good” codes. Better characteriza- 
tions of good code words that are closed under Kleene * operation may provide 
even faster ways for designing such codewords. Although the Proposition 36 
provides a rather good design of code words, the potential repetition of certain 
subwords is not desirable. The most challenging questions of characterizing and 
designing good 9-k-codes that avoids numerous repetition of subwords remains 
to be developed. 

Our approach to the question of designing “good” DNA codes has been from 
the formal language theory aspect. Many issues that are involved in designing 
such codes have not been considered. These include (and are not limited to) 
the free energy conditions, melting temperature as well as Hamming distance 
conditions. All these remain to be challenging problems and a procedure that 
includes all or majority of these aspects will be desirable in practice. It may be 
the case that regardless of the way the codes are designed, the ultimate test for 
the “goodness” of the codes will be in the laboratory. 
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Abstract. P systems with symport/antiport rules of a minimal size 
(only one object passes in any direction in a communication step) were 
recently proven to be computationally universal. The proof from [2] uses 
systems with nine membranes. In this paper we improve this results, by 
showing that six membranes suffice. The optimality of this result remains 
open (we believe that the number of membranes can be reduced by one) . 



1 Introduction 

The present paper deals with a class of P systems which has recently received 
a considerable interest: the purely communicative ones, based on the biological 
phenomena of symport/antiport. 

P systems are distributed parallel computing models which abstract from 
the structure and the functioning of the living cells. In short, we have a mem- 
brane structure, consisting of several membranes embedded in a main membrane 
(called the skin) and delimiting regions (Figure 1 illustrates these notions) where 
multisets of certain objects are placed. In the basic variant, the objects evolve 
according to given evolution rules, which are applied non-deterministically (the 
rules to be used and the objects to evolve are randomly chosen) in a maximally 
parallel manner (in each step, all objects which can evolve must evolve). The 
objects can also be communicated from one region to another one. In this way, 
we get transitions from a configuration of the system to the next one. A sequence 
of transitions constitutes a computation; with each halting computation we asso- 
ciate a result, the number of objects in an initially specified output membrane. 

* Research supported by Natural Sciences and Engineering Research Council of 
Canada grants and the Canada Research Chair Program to L.K. and A.P. 
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Since these computing devices were introduced ([8]) several different classes 
were considered. Many of them were proved to be computationally complete, able 
to compute all Turing computable sets of natural numbers. When membrane di- 
vision, membrane creation (or string-object replication) is allowed, NP-complete 
problems are shown to be solved in polynomial time. Comprehensive details can 
be found in the monograph [9] , while information about the state of the art of the 
domain can be found at the web address http://psystems.disco.unimib.it. 



membrane skin elementary membrane 



region 




Figure 1: A membrane structure 



A purely communicative variant of P systems was proposed in [7], modeling 
a real life phenomenon, that of membrane transport in pairs of chemicals - see 
[1]. When two chemicals pass through a membrane only together, in the same 
direction, we say that we have a process of symport. When the two chemicals pass 
only with the help of each other, but in opposite directions, the process is called 
antiport. For uniformity, when a single chemical passes through a membrane, 
one says that we have an uniport. 

Technically, the rules modeling these biological phenomena and used in P 
systems are of the forms {x , in) , {x , out) (for symport), and (x,out;y,in) (for 
antiport), where x^y are strings of symbols representing multisets of chemicals. 
Thus, the only used rules govern the passage of objects through membranes, the 
objects only change their places in the compartments of the membrane structure, 
they never transform/evolve. 

Somewhat surprisingly, computing by communication only, in this “osmotic” 
manner, turned out to be computationally universal: by using only symport and 
antiport rules we can compute all Turing computable sets of numbers, [7]. The 
results from [7] were improved in several places - see, e.g., [3], [4], [6], [9] -in what 
concerns the number of membranes used and/or the size of symport /antiport 
rules. 






256 



Lila Kari, Carlos Martm-Vide, and Andrei Paun 



Recently, a rather unexpected result was reported in [2]: in order to get 
the universality, minimal symport and antiport rules, that is of the forms 
(a, in), (a, out), (a, out; b, in), where a, b are objects, are sufficient. The price was 
to use nine membranes, much more than in the results from [3] and [4], for ex- 
ample. The problem whether or not the number of membranes can be decreased 
was formulated as an open problem in [2]. We contribute here to this question, 
by improving the result from [2]: six membranes suffice. The proof uses the same 
techniques as the proofs from [2], [4]: simulating a counter automaton by a P 
system with minimal symport /antiport rules. 

It is highly probable that our result is not optimal, but we conjecture that 
it cannot be significantly improved; we believe that at most one membrane can 
be saved. 

2 Counter Automata 

In this section we briefly recall the concept of counter automata, useful in the 
proof of our main theorem. We follow here the style of [2] and [4]. Informally 
speaking, a counter automaton is a finite state machine that has a finite number 
of counters able to store values represented by natural numbers; the machine 
runs a program consisting of instructions which can increase or decrease by one 
the contents of registers, changing at the same time the state of the automaton; 
starting with each counter empty, the machine performs a computation; if it 
reaches a terminal state, then the number stored in a specified counter is said 
to be generated during this computation. It is known (see, e.g., [5]) that counter 
automata (of various types) are computationally universal, they can generate 
exactly all Turing computable sets of natural numbers. 

More formally, a counter automaton is a construct M = {Q, F,po, C, Cout, S), 
where: 

— Q is the set of the possible states, 

— F C Q is the set of the final states, 

— Po G Q is the start state, 

— C is the set of the counters, 

“ Cout G C* is the output counter, 

~ S' is a finite set of instructions of the following forms: 

{p q, -kc), with p,q € Q, c G C: add 1 to the value of the counter c 
and move from state p into state q; 

{p q, —c), with p,q € Q, c € C: if the current value of the counter c is 
not zero, then subtract 1 from the value of the counter c and move from 
state p into state q; otherwise the computation is blocked in state p; 

{p ^ q,c= 0), with p,q € Q, c € C: if the current value of the counter c 
is zero, then move from state p into state q; otherwise the computation 
is blocked in state p. 
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A transition step in such a counter automaton consists in updating/checking 
the value of a counter according to an instruction of one of the types presented 
above and moving from a state to another one. Starting with the number zero 
stored in each counter, we say that the counter automaton computes the value n 
if and only if, starting from the initial state, the system reaches a final state after 
a finite sequence of transitions, with n being the value of the output counter Cout 
at that moment. 

Without loss of generality, we may assume that in the end of the computation 
the automaton makes zero all the counters but the output counter; also, we may 
assume that there are no transitions possible that start from a final state (this 
is to avoid the automaton getting stuck in a final state). 

As we have mentioned above, such counter automata are computationally 
equivalent to Turing machines, and we will make below an essential use of this 
result. 

3 P Systems with Symport/ Antiport Rules 

The language theory notions we use here are standard, and can be found in any 
of the many monographs available, for instance, in [11]. 

A membrane structure is pictorially represented by a Venn diagram (like the 
one in Figure 1), and it will be represented here by a string of matching paren- 
theses. For instance, the membrane structure from Figure 1 can be represented 

by [ 1(2 ]2l3 IsUIs IsIeU Isig Iglel? IrUli- 

A multiset over a set A is a mapping M : X — > N. Here we always use 
multisets over finite sets X (that is, X will be an alphabet). A multiset with a 
finite support can be represented by a string over A; the number of occurrences 
of a symbol a G A in a string x G A* represents the multiplicity of a in the 
multiset represented by x. Clearly, all permutations of a string represent the 
same multiset, and the empty multiset is represented by the empty string, A. 

We start from the biological observation that there are many cases where 
two chemicals pass at the same time through a membrane, with the help of each 
other, either in the same direction, or in opposite directions; in the first case we 
say that we have a symport, in the second case we have an antiport (we refer to 
[1] for details). 

Mathematically, we can capture the idea of symport by considering rules of 
the form (ab, in) and {ab, out) associated with a membrane, and stating that the 
objects a, b can enter, respectively, exit the membrane together. For antiport we 
consider rules of the form (o, out; b, in), stating that a exits and at the same time 
b enters the membrane. Generalizing such kinds of rules, we can consider rules of 
the unrestricted forms (x,in), (x,out) (generalized symport) and {x, out;y,in) 
(generalized antiport), where x,y are non-empty strings representing multisets 
of objects, without any restriction on the length of these strings. 

Based on rules of this types, in [7] one introduces P systems with sym- 
port /antiport as constructs 

II — (V, /i, Wi, . . . , A, R\, . . . , Rm,^ ^o); 
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where: 

— F is an alphabet (its elements are called objects); 

— /i is a membrane structure consisting of m membranes, with the membranes 
(and hence the regions) injectively labeled with 1, 2 , . . . , m; m is called the 
degree of II; 

— Wi,l <i <m, are strings over V representing multisets of objects associated 
with the regions 1, 2, . . . , m of /i, present in the system at the beginning of 
a computation; 

— E C V is the set of objects which are supposed to continuously appear in 
the environment in arbitrarily many copies; 

— Ri, , Rm are finite sets of symport and antiport rules over the alphabet 
V associated with the membranes 1, 2, . . . , m of /x; 

— io is the label of an elementary membrane of /x (the output membrane). 

For a symport rule {x,in) or (x,out), we say that |a;| is the weight of the 
rule. The weight of an antiport rule (x,out;y,in) is max{|a;|, |y|}. 

The rules from a set Ri are used with respect to membrane i as explained 
above. In the case of (x,in), the multiset of objects x enters the region defined 
by the membrane, from the surrounding region, which is the environment when 
the rule is associated with the skin membrane. In the case of (x,out), the ob- 
jects specified by x are sent out of membrane i, into the surrounding region; 
in the case of the skin membrane, this is the environment. The use of a rule 
(x, out; y, in) means expelling the objects specified by x from membrane i at the 
same time with bringing the objects specified by y into membrane i. The objects 
from E (in the environment) are supposed to appear in arbitrarily many copies; 
since we only move objects from a membrane to another membrane and do not 
create new objects in the system, we need a supply of objects in order to com- 
pute with arbitrarily large multisets. The rules are used in the non-deterministic 
maximally parallel manner specific to P systems with symbol objects: in each 
step, a maximal number of rules is used (all objects which can change the region 
should do it). 

In this way, we obtain transitions between the configurations of the system. 
A configuration is described by the m-tuple of multisets of objects present in 
the m regions of the system, as well as the multiset of objects from V — E which 
were sent out of the system during the computation; it is important to keep 
track of such objects because they appear only in a finite number of copies in 
the initial configuration and can enter the system again. On the other hand, it is 
not necessary to take care of the objects from E which leave the system because 
they appear in arbitrarily many copies in the environment as defined before (the 
environment is supposed to be inexhaustible, irrespective how many copies of 
an object from E are introduced into the system, still arbitrarily many remain 
in the environment). The initial configuration is (wi, . . . , Wm, A). A sequence of 
transitions is called a computation. 

With any halting computation, we may associate an output represented by 
the number of objects from V present in membrane io in the halting configu- 
ration. The set of all such numbers computed by II is denoted by N{II). The 
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family of all sets N{1J) computed by systems U of degree at most m> 1, using 
symport rules of weight at most p and antiport rules of weight at most q, is 
denoted by NOPm{synip, antiq) (we use here the notations from [9]). 

Details about P systems with symport/antiport rules can be found in [9]; 
a complete formalization of the syntax and the semantics of these systems is 
provided in [10]. 

We recall from [3], [4] the best known results dealing with the power of P 
systems with symport/antiport. 

Theorem 1. NRE = NOPm{symr,antit), for (m,r,t) € {(1,1,2), (3,2,0), 
(2,3,0)}. 

The optimality of these results is not known. In particular, it is an open 
problem whether or not also the families NOPm{symr,antit) with (m,r,t) G 
1(2, 2, 0), (2, 2, 1)1 are equal to NRE. 

Note that we do not have here a universality result for systems of type 
(m, 1, 1). Recently, such a surprising result was proved in [2]: 

Theorem 2. NRE = NOPg{symi,antii). 

Thus, at the price of using nine membranes, uniport rules together with 
antiport rules as common in biology (one chemical exits in exchange with other 
chemical) suffice for obtaining the Turing computational level. The question 
whether or not the number of membranes can be decreased was formulated as 
an open problem in [2]. 

4 Universality with Six Membranes 

We (partially) solve the problem from [2] , by improving the result from Theorem 
2: the number of membranes can be decreased to six - but we do not know 
whether this is an optimal bound or not. 

Theorem 3. NRE = NOPQ{symi, antii). 

Proof. Let us consider a counter automaton M = (Q,F,po,C\cout,S) as speci- 
fied in Section 2. We construct the symport/antiport P system 

n = (y, P,Wi,W2,W3,W4,W5,Wq, E, Ri, Rg, Rs, Ri, R5, RePo), 



where: 

V = Q U {cq \ c G C, q G Q, and {p — > q, +c) G 5} 

U |c^, dc,q I c G C, q G Q, and {p q, — c) G S} 

U |c", I c G C, qGQ, and (p ^ g, c = 0) G S'} 

U {tti, 02, 03, 04 , 61, 62, zi, *2, is, H, *5, h, h', h", ni, rz2, ns, ns, #1, #3}, 

~ [il 2 [3U 14 ] 3 Isle lels 12] 1 ’ 
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Wl = 6 i 62 # 3 , 

W2 = aiZiZ2Z4i5’T-l»^2?T-3«4^"#l, 

W3 = 020313, 

W4 = 04#3, 

W5 = h\ 

Wq = A, 

£^ = ^-{#1,61}, 

io = 6 , 

Ri = R[yj R'l U R'l' , where 1 < i < 6. 

Each computation in M will be simulated by U in three main phases; the first 
phase will use rules from i?', 1 < z < 6, R” contains the rules for the second 
phase, and i?"' are the rules used for the third phase. These phases perform the 
following operations: (1) preparing the system for the simulation, (2) the actual 
simulation of the counter automaton, and ( 3 ) terminating the computation and 
moving the relevant objects into the output membrane. 

We give now the rules from the sets i?', i?", R”' for each membrane together 
with explanations about their use in the simulation of the counter automaton. 

Phase 1 performs the following operations: we bring in membrane 2 an arbi- 
trary number of objects q & Q that represent the states of the automaton, then 
we also bring in membrane 4 an arbitrary number of objects and ^ that 
will be used in the simulation phase for simulating the rules (p ^ q,~c) and 
(p ^ q, c = 0 ), respectively. The rules used in this phase are as follows: 

R'l = {(&!, ozxt; X,in) \ X {dc.p, d'^^p | c G C,p € Q}} U {( 5 i,zn)}, 

i?2 = {(^2, out\ X,in) \ X e Q U {dc.p, d'^ p | c G C,p G Q}} U {(62, in)} 

U {{ai,out; bi,in), (&2, out; #3, in), (02, out; ai,in), (02, out; #3, in)}, 

R's = {{as, out; d, in) \ d G [dc,p,d{ p \ c G C,p G Q}} U {(o3,zrz)} 

U {(o2, out; 62, in), (04, out, bi,in), (#3, in), (#3, out)}, 
i?4 = {(04, out; d, in) \ d G {dc,p, d{ p \ c G C,p G Q}} U {(04, in)} 

U {(04, out; 62, in), (&2, out; 03, in)}, 
i?5 = {{h',out; h” ,in),{h” ,out,h' ,in)}, 
i?^ = 0. 

The special symbol b\ brings from the environment the objects q, dc^q, d{ ^ 
by means of the rules (61, out; X, in), and at the same time the symbol 62 enters 
membrane 2 using the rule {b2, in). At the next step bi comes back in the system, 
while 62 moves the object that was introduced in membrane 1 in the previous 
step, q, dc,q, or d} ,^ into membrane 2 by means of the rules {b2,out; X,in). We 
can iterate these steps since we reach a configuration similar with the original 
configuration. 

If the objects moved from the environment into membrane 2 are versions 
of d, then those objects are immediately moved into membrane 4 by the rules 
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{as, out; d, in) G R'^ and {as, out; d, in) G R's- One can notice that in the sim- 
ulation phase these special symbols that we bring in the system in this initial 
phase are used to simulate some specific rules from the counter automaton. A 
difficulty appears here because there are rules that will allow such a symbol to 
exit membrane 1 bringing in another such symbol (this leads to a partial simu- 
lation of rules from the counter automaton) . To solve this problem we make sure 
that the symbols that we bring in the system do end up into one of membranes 
2 or 4: if an object q, dc^q, or exits membrane 1 (using a rule from i?") 
immediately after it is brought in, then 62 which is in membrane 2 now has to 
use the rule (02, out; #3, in) G R'2 and then the computation will never halt since 
^3 is present in membrane 2 and will move forever between membranes 2 and 3 
by means of {^3, in), {^3, out) from i?g. 

After bringing in membrane 2 an arbitrary number of symbols q and in 
membrane 4 an arbitrary number of symbols dc,q,d'^^ we pass to the second 
phase of the computation in U, the actual simulation of rules from the counter 
automaton. Before that we have to stop the “influx” of special symbols from the 
environment: instead of going into environment, b\ is interchanged with ai from 
membrane 2 by means of the rule {ai,out;b\,in); at the same step 62 enters 
the same membrane 2 by {b2,in). Next 62 is interchanged with 02 by using 
the rule {02, out; b2, in) G R'^, then in membrane 4 the same 62 is interchanged 
with 04 by means of {as, out; b2, in) G R's, simultaneously, for membrane 2 we 
apply {u2, out; as, in). At the next step we use the rules {as, out, bi, in) and 
(62, out; as, in) from A4. 

There are two delicate points in this process. First, if 62 instead of bring- 
ing in membrane 2 the objects from environment starts the finishing pro- 
cess by {a2,out;b2,in) G R'^, then at the next step the only rule possible is 
(o2, out; ^3, in) G R'2, since oi is still in membrane 2, and then the computation 
will never render a result. The second problem can be noticed by looking at the 
rules {as, in) and {as, out;bi,in) associated with membranes 4 and 3, respec- 
tively: if instead of applying the second rule in the finishing phase of this step, 
we apply the rule of membrane 4, then the computation stops in membranes 1 
through 4, but for membrane 5 we will apply the rules from R'^ continuously. 

The second phase starts by introducing the start state of M in membrane 
1, then we simulate all the rules from the counter automaton; to do this we use 
the following rules: 

R![ = {{as,out;po,in), (#i,OMt), (#i,zn)} U {{p,out; Cq,in), {p, out; c'^, in), 
{dc,q, out; q, in), {p, out; c", in), (d(, out; q, in) \ p,q G Q,c G C}, 

R'2 = {(<7, out; Cq, in), (#1, out; Cq, in), {n\,out; c^, in), {dc^q, out; ns, in), 

{ii,out; c'g,in), {dc,q,out; #3, in), {d'^ g,out; is, in), {d'^ g,out; #3, in) \ 

qGQ,cGC} 

U {{as, out), {u2, out; ni,in), {ns, out; ri2, in), {ns, out; ns, in), 

(#1, out; ns, in), (z2, out; i\,in), (is, out; Z2, in), {is, out; is, in), 

{is, out; is,in)}. 
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R'i = {{Cq^in), {dc,q, out; Ca,in), {is, out; c”,in), {d'^g,out; Ca,in), 

{d'^ q, out; is, in) \ q,a G Q,c e C}, 

R'i = {{dc,q, out; c'g, in), (#3, out; c'g, in), {d'^ g, out; c", in), (#3, out; c", in) \ 
q G Q , c G C}) 

m = 0 , 

i?" = 0. 

We explain now the usage of these rules: we bring in the system the start 
state po by using the rules {04, out) G R'2 and then {04, out; po, in) G R'I; 04 
should be in membrane 2 at the end of the step 1 if everything went well in the 
first phase. 

We are now ready to simulate the transitions of the counter automaton. 

The simulation of an instruction {p q, +c) is done as follows. First p is 
exchanged with Cq by {p, out; Cq, in) G R'(, and then at the next step q is pushed 
in membrane 1 while Cq enters membrane 2 by means of the rule {q, out; Cq, in) G 
i?2- If there are no more copies of q in membrane 2, then we have to use the rule 
(^1, out; Cq, in) G R'2, which kills the computation. It is clear that the simulation 
is correct and can be iterated since we have again a state in membrane 1. 

The simulation of an instruction {p —>■ q, —c) is performed in the fol- 
lowing manner. The state p is exchanged in this case with by the rule 
(p,out;c'g,in) G R'{. The object is responsible of decreasing the counter c 
and then moving the automaton into state q. To do this will go through the 
membrane structure up to membrane 4 by using the rules (rii, out; c'g, in) G R'2, 
{c'g, in) G R's, and {da,q, out; c'g, in) G R'(. When entering membrane 2 , it starts a 
“timer” in membrane 1 and when entering membrane 4 it brings out the symbol 
which will perform the actual decrementing of the counter c. 

The next step of the computation involves membrane 3 , by means of the rule 
{dc^q,out;Ca,in) G R's, which is effectively decreasing the content of counter 
c. If no more copies of Cq. are present in membrane 2 , then dc^q will sit in 
membrane 3 until the object n, the timer, reaches the subscript 4 and then 
^3 is brought in killing the computation, by means of the following rules from 
i?2- (n2,out; ni,in), {ns, out; n2, in), (ri4, out; ns, in), (#1, out; 714, in). If there is 
at least one copy of Cq, in membrane 2 , then we can apply {dc,q, out; 774, in) G R'2 
and then we finish the simulation by bringing q in membrane 1 by means of 
{dc^q,out;q,in) G R'{. If dc,q was not present in membrane 4 , then #3 will be 
released from membrane 4 by (#3, OTxt; c^, 777) G R'^. It is clear that also these 
instructions are correctly simulated by our system, and also the process can be 
iterated. 

It remains to discuss the case of rules {p ^ q,c = 0 ) from the counter 
automaton. The state p is replaced by c" by {p,out;c'g,in) G R'{, then this 
symbol will start to increment the subscripts of i when entering membrane 2: 
{ii, out; c", in) G R'2, at the next step the subscript of i is incremented in mem- 
brane 1 and also 73 is pushed in membrane 2 by means of {is, out; c'g,in) G R's- 
At the next step the special marker d'^ ^ is brought out of membrane 4 by 
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means {d'^g,out;Cg,in) G R'^ and the subscript of i is still incremented by 
(t3, out; Z2, in) G i?2 • Now performs the checking for the counter c (whether 
it is zero or not): if there is at least one Ca present, then ^ will enter mem- 
brane 2, and at the next step will bring ^3 from membrane 1 since the subscript 
of i did not reach position 5; on the other hand, if there are no copies of c in 
membrane 2, then d'^ ,j will sit unused in membrane 3 for one step until is is 
brought from membrane 1 by {i 4 , out; is, in) G i?2^ then we apply the follow- 
ing rules: {is, out; i 4 , in) G R '2 and {d'^ ,^, out; is, in) G R's- Next we can apply 
{d'^ out; is, in) G R'^ and then in membrane 1 we finish the simulation by us- 
ing {d'^ q,out;q,in) G R". One can notice that all the symbols are in the same 
place as they were in the beginning of this simulation {is is back in membrane 
3, Zi,i2,*4)*5 are in membrane 2, etc.), the only symbols moved are one copy of 
d{ „ which is now in the environment and c{' which is in membrane 4. It is clear 
that we can iterate the process described above for all the types of rules in the 
counter automaton, so we correctly simulate the automaton. 

The third phase, the finishing one, will stop the simulation and move the 
relevant objects into the output membrane. Specifically, when we reach a state 
p G F we can use the following rules: 

R'l = {{p, out; h, in) \ p G F}, 

R'i' = {{h,in)}, 

R'i' = 0 , 

R7 = 0 , 

R's = {{h' , out; h, in), {h" , out; h, in), {h, in)} 

U {{h,out;couta,in) \ a G Q}, 

Rq — }{couta, ^n) \ G Q} . 

We first use {p, out; h, in) G R'" , then h enters membrane 2 by {h, in) and at 
the next step h stops the oscillation of h' and h" by putting them together in 
membrane 2 by means of {h' , out; h, in) G R's or {h" , out; h, in) G R's ■ After this 
h begins moving the content of output counter Cout into membrane 5 by using 
{h, out; Ca, in) G R's ■ When the last Cq, enters membrane 6 by using {ca,in) G R'q 
the system will be in a halting state only if a correct simulation was done in 
phases one and two, so the counter automaton was correctly simulated. This 
completes the proof. 

5 Final Remarks 

One can notice that membrane 6 was used only to collect the output. The same 
system without membrane 6 will simulate in the same way the counter automa- 
ton, but, when reaching the halt state will also contain the symbol h in the 
output membrane 5. This suggests that it could be possible to use a similar 
construct to improve the result from Theorem 3 to a result of the form: 

Conjecture: NOPs{symi, antii) = RE. 
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Obviously, P systems with minimal symport/ antiport rules and using only 
one membrane can compute at most finite sets of numbers, at most as large 
as the number of objects present in the system in the initial configuration: the 
antiport rules do not increase the number of objects present in the system, the 
same with the symport rules of the form (a, out), while a symport rule of the 
form (a, in) should have a G V — E (otherwise the computation never stops, 
because the environment is inexhaustible). 

The family NOP 2 {symi, antii) contains infinite sets of numbers. Consider, 
for instance, the system 

n = ({a,6}, [J 2 {6},i?i,i?2,2), 

Ri = {{a,out;b,in), {a, in)}, 

R 2 = {(a, in), {b, in)}. 

After bringing an arbitrary number of copies of b from the environment, the 
object a gets “hidden” in membrane 2, the output one. 

An estimation of the size of families NOPm{synii,antii) for m = 2, 3,4, 5 
remains to be found. 

The P systems with symport and antiport rules are interesting from several 
points of view: they have a precise biological inspiration, are mathematically 
elegant, the computation is done only by communication, by moving objects 
through membranes (hence the conservation law is observed), they are compu- 
tationally complete. Thus, they deserve further investigations, including from 
the points of view mentioned above. 
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Abstract. We are concerned with a problem of checking the structure 
freeness of for a given set S of DNA sequences. It is still open whether 
or not there exists an efficient algorithm for this problem. In this paper, 
we will give an efficient algorithm to check the structure freeness of 
under the constraint that every sequence may form only linear secondary 
structures, which partially solves the open problem. 



1 Introduction 

In the Adleman’s pioneering work on the biomolecular experimental solution to 
the directed Hamiltonian path problem ([!]) and in many other works involving 
wet lab experiments performed afterward, it has been recognized to be very 
important how to encode information on DNA sequences, in order to guarantee 
the reliability of those encoded DNA sequences to avoid mishybridization. The 
problem of finding a good set of DNA sequences to use for computing is called 
the DNA sequence design problem. In spite of the importance of this problem, 
it seems that only rather recently research efforts have been paid to develop 
systematic methods for solving this problem. An excellent survey on this topic 
of DNA sequence design issues can be found in [6] . 

Being currently engaged in a research activity called molecular programming 
project in Japan, we are aiming as a final goal at establishing a systematic 
methodology for embodying desired molecular solutions within the molecular 
programming paradigm in which designing not only DNA sequences but also 
molecular reaction sequences of molecular machines are targeted as major goals 
([12]). Here, by designing DNA sequences we mean a broader goal than the one 
mentioned above, e.g., the DNA sequence design may generally deal with the 
inverse folding problem, too, the problem of designing sequences so as to fold 
themselves into intended structural molecules. 



N. Jonoska et al. (Eds.): Molecular Computing (Head Festschrift), LNCS 2950, pp. 266—277, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Taking a backward glance at the research area of DNA sequence design, 
there are already several works that employ some variants of Hamming distance 
between sequences and propose methods to minimize the similarity between 
sequences based on that measure ([8], [11]). Recently, Arita proposed a new 
design technique, called template method ([3]), which enables us to obtain a very 
large pool of sequences with uniform rate of the GC content such that any pair of 
the sequences is guaranteed to have a fairly large amount of mismatched distance 
(in the presence of the shift operation). The success of the template method 
is due to the use of sophisticated theory of error correcting codes. However, 
design methods based on the variants of Hamming distance do not take into 
consideration the possibility of forming secondary structures such as internal or 
bulge loops. Therefore, it would be very useful if we could devise a method for 
extracting a structure free subset from a given set of sequences, where structure 
freeness of the sequence set is defined as the property that the sequences in 
that set does not form stable secondary structures. (The notion of structure 
freeness will be defined precisely in Section 2.3.) In order to solve this extraction 
problem, it is of a crucial importance to devise an efficient algorithm to test 
the structure freeness of a given set of sequences. This motivated us to focus on 
this structure freeness test problem in this paper. There are in fact some works 
that took up for investigation the structure formation in the context of sequence 
design problem ([10], [15], [16]). An essential principle (or feature) commonly 
used in these papers involves the statistical thermodynamic theory of the DNA 
molecules to compute a hybridization error rate. 

The present paper focuses on giving a necessary and sufficient condition to 
guarantee the global structure freeness of the whole set of sequences. More for- 
mally, we have interests in the structure freeness of , where S is the set of 
sequences to be designed, and is the set of sequences obtained by concate- 
nating the elements of S finitely many times. Concerning the structure freeness 
of , Andronescu et al. proposed a method for testing whether S'™ is structure- 
free or not, where to is a positive integer or -|- ([2]). They gave a polynomial time 
algorithm for the case that to is a positive integer, but the proposed algorithm 
for the case of to = -I- runs in exponential time. This leaves it open whether 
or not there exists an efficient algorithm for testing the structure freeness of 
S+. In this paper, based on the idea of reducing the test problem to a classical 
shortest-path problem on a directed graph, we present an efficient algorithm for 
testing the structure freeness of S'^ with the condition that every sequence in 
S'"*' may form only linear secondary structures, which partially solves the open 
problem posed in [2]. 

2 Preliminaries 

Let S be an alphabet {A, C, G, T} or {A, C, G, U}. A letter in S is also called a 
base in this paper. Furthermore, a string is regarded as a base sequence ordered 
from 5'-end to 3'-end direction. Consider a string a over S. By jaj we denote 
the length of a. On the other hand, for a set X,hy jA"] we denote the cardinality 
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of the set X. For integers i,j such that I < i < j < |a|, by a[i,j] we denote 
the substring of a starting from the ith letter and ending at the jth letter of a. 
In case of i = j, we simply write a[i]. In case of i > j, cx[i,j] represents a null 
string. 



2.1 Secondary Structure 

We will mainly follow the terminologies and notations used in [17]. Let us in- 
troduce the Watson-Crick complementarity function 9 : S ^ X defined by 
0{A) = T, 0(T) = A, 9(C) = G, and 1?(G) = C A hydrogen bond between the 
Ah base and jth base of a is denoted by (i,j), and called a base pair. A hydrogen 
bond (i,j) can be formed only if 9{a[i]) = a[j\. Without loss of generality, we 
may always assume that i < j for a base pair (i, j). A secondary structure of 
a is a finite set of such base pairs of a. A string a together with its secondary 
structure T is called a structured string, and written a(T). For representing the 
ith base in a(T), we often use the integer i. 

For three bases i, j, and r in a(T), we say that i and j surround r if i < r < j. 
In case of (i,j) G T, we can also say that the base pair (i,j) surrounds r. 
For two base pairs (i,j) and (p,q), we say that (i,j) surrounds (p,q), written 
(p,q) < (i,j) or (i,j) > (p,q), if (i,j) surrounds both p and q. In this paper, we 
consider only secondary structures which do not contain pseudo-knots, multiple 
loops, or parallel concatenation of hairpin structures. More formally, we consider 
only secondary structures T such that the base pairs of T can be linearly ordered 
with respect to the relation <. Such secondary structures are said to be linear. 
In the sequel, we will assume that every secondary structure is linear. 

A base pair (p, q) or an unpaired base r is said to be accessible from a base 
pair (i,j), if it is surrounded by (i,j) and is not surrounded by any base pair 
(k,l) such that (k,l) < (i,j). 

For each pair bp = (i,j) € T, we define a cycle c(bp) as a substructure 
consisting of the pair (i,j) together with any pairs (pi, gi), (p 2 , 52 ), ••• accessible 
from (i,j) and any unpaired bases accessible from (i,j). If c(bp) contains k pairs 
(including the pair (i,j)), it is called a fc-cycle or a cycle of order k. Since we 
consider only linear secondary structures, every cycle contained in a(T) is either 
a 1-cycle or a 2-cycle. 

A cycle of order k defined by (i,j) is classified as follows: (See also Figure 1.) 

1. In case of fc = 1, it is called a hairpin. 

2. In case that k = 2 and the accessible pair is (i-l- 1, j — 1), it is called a stacked 
pair. 

3. In case that k = 2 and the accessible pair is either (i+ l,p) or (p,j — 1) for 
some i < p < j, it is called a bulge loop. 

4. In case that k = 2 but any condition above is not satisfied, it is called an 
internal loop. 

For the case of RNA strings, replace the letter T by U. 



4 
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The loop length of a hairpin c with a base pair (z, j) is defined as the number 
of unpaired bases j — i — 1- The loop length of a bulge loop or an internal loop 
with base pairs {i,j) and {p,q) {{p,q) < {hj)) is also defined as the number of 
unpaired bases p — i + j — q — 2. The loop length mismatch of an internal loop 
with base pairs (z, j) and {p, q) ((p, q) < (z, j)) is defined as |(p — i) — {j — q)\. 

Let (z, j) be the element in T such that for any (p,q) € T with (p,q) yf (*,j), 
(p,q) < {i,j) holds. Since T is linear, such an element is determined uniquely. 
Then, the substructure of a{T) consisting of the pair (z, j) and unpaired bases 
which are not surrounded by (z, j) is called a free end structure of oi{T). 

By I a) we denote a hairpin consisting of a sequence a with a base pair 



between a[l] and o;[|a|]. By 



we denote a 2 -cycle consisting of two sequences 



a and j3 with base pairs between o;[l] and 



and between a[|o;|] and j3[V\. 



By 13 



we denote a free end structure consisting of two sequences a and l3 with 



a base pair between a[|a|] and (3[1]. 



2.2 Free Energy Calculation 

Free energy value is assigned to each cycle or free end structure. Experimental 
evidence is used to determine such free energy values. The method for assigning 
free energy values is given as follows (See Figure 1.) 

1. The free energy E(c) of a hairpin c with a base pair (z,j) is dependent on 
the base pair (z, j), the bases z-l- 1 , j — 1 adjacent to (z, j), and its loop length 
h 

E{c) = hi{a[i\,a[j],a[i + l],a[j - 1]) -h fi2(0) 

where hi, /z 2 are experimentally obtained functions. The function /z 2 is posi- 
tive and there exists a constant L such that for the range I > L, h 2 is weakly 
monotonically increasing. 

2. The free energy if (c) of a stacked pair c with base pairs (z,j) and (z-|-l,j — 1) 
is dependent on the base pairs (z, j), (z -I- 1, j — 1): 

E{c) = si{a[i],a[j],a[i + l],a[j - 1]), 

where si is an experimentally obtained function. 

3. The free energy E(c) of a bulge loop c with base pairs (z, j) and (p, q) ((p, q) < 
(z, j)) is dependent on the base pairs (z, j), (p, (?), and its loop length 1: 

E{c) = bi{a[i],a[j],a[p],a[q\) + h 2 {l), 

where 61,62 are experimentally obtained functions. The function 62 is posi- 
tive and weakly monotonically increasing. 

® The calculation method presented here is in a general form so that it can be special- 
ized to be equivalent to the one used in standard RNA packages such as ViennaRNA 
([13]), mfold ([18]), etc. 
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4. The free energy E(c) of an internal loop c with base pairs (i,j) and (p,q) 
{{p, q) < (i,j)) is dependent on the base pairs (i,j), {p, q), the bases — l 
adjacent to (i, j), the bases p— 1, g + 1 adjacent to (p, q), its loop length I, 
and its loop length mismatch d: 

E{c) = ei{a[i],a[j],a[p],a[q],a[i+l],a[j-l],a[p-l],a[q+l]) + e2{l) + e^{d), 

where 61,62,63 are experimentally obtained functions. The functions 62,63 
are positive and weakly monotonically increasing. 

5. The free energy E(c) of a free end structure c with a base pair (i,j) is 
dependent on the base pair (t, j) and the bases i — l,j + 1 adjacent to (z, j): 

E{c) = di{a[i\,a[j],a[i - l],a[j + 1]), 

where di is an experimentally obtained function. 

We assume that all functions h\, /12, si, ^i, ^2, ci, 62, 63, di are computable in 
constant time. 

Let Cl, ...,Cfc be the cycles contained in a{T), and cq be a free end structure 
of a{T). Then, the free energy E{a{T)) of a{T) is given by: 

E{a{T)) = Et,E{ci). 

There exist efficient algorithms for prediction of secondary structures ([17], 
[18]), which is due to the fact that the free energy of a structured string is defined 
as the sum of free energies of cycles and free end structures contained in it. This 
fact also plays important roles in constructing the proposed algorithm in this 
paper. Furthermore, the property of energy functions that ft-2, ^2, 62, and 63 are 
weakly monotonically increasing is important for the correctness of the proposed 
algorithm. In particular, this property plays an important role in Theorem 1. 

2.3 Structure Fteeness Test Problem 

In this paper, we will consider the problem of testing whether a given finite set 
of strings of the same length n is structure free or not. The problem is formally 
stated as follows: 

Input: A finite set S of strings of the same length n 

Output: Answer “yes” if for any structured string a(T) such that a G 

and T is linear, E{a{T)) > 0 holds, otherwise, answer “no”. 

We will give an efficient algorithm for solving this problem. 

3 Configuration of Structured Strings 

Let S' be a finite set of strings of length n over E. Let a{T) be a structured 
string such that a G S+. Let us consider a base pair (i,j) G T of a{T). Let 
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M + 1 




i?' - 1 



1 i — li 



N j + Ij 



(a) Hairpin 



a + 1 p — p 



(b) Free End 




q + p 



(c) Internal Loop 

(Bulge Loop in case ofi + 1— porj = q + l) 
(Stacked Pair in case of i + 1 = p and j = q + 1) 



Fig. 1. Basic Secondary Structures 



/3, 7 G S. Then, (i,j) is said to have configuration in a{T) if the ith 

and jth bases of a correspond to the fcth base of the segment (3 and the ?th base 
of the segment 7, respectively. More formally, (z, j) is said to have configuration 
in a{T) if there exist x,y,z,w G S* such that a = x[ 3 y = z^w, 
|a:| < z = |x| + A: < \xfi\ and \z\ < j = |^| + ^ < \z^\ - For a base pair bp in a{T), 
by cf{bp) we denote the configuration of bp. For a configuration (/ 3 ,k,'j,l), we 
prefer to use the following representation: 

\ 7 (0 ) 

that is graphically more appealing to the reader. 

The configuration cf{a(T)) of a{T), where T = {bpi, ...,bpm} and bpi > 
■■■ > bpm, is defined as a sequence cf{bpi),...,cf{bpm)- For two structured 
strings ai(Ti) and 02(12), we write ai(Ti) = 02(^2) ifc/(ai(Ti)) = cf{a2{T2)). 

A structured string a{T) is said to be E-minimal if for any a'(T') such 
that o(T) = afT'), E{a{T)) < E{a'{T')) holds. The existence of such an E- 
minimal structured string is not clear at this point. But, we can show that for 
any configuration C, there always exists an E-minimal structured string o(T) 
such that cf{a{T)) = C, which will be implicitly proved in the proof of (2)^(1) 
in Theorem 2. 

A 2-cycle c with base pairs bpi,bp2 {bp2 < bpi) in a{T) is said to have 
boundary configuration (^1,^2) if cf{bpi) = Uj (z = 1, 2). A 1-cycle or a free end 
structure c with a base pair bp in a(T) is said to have the boundary configuration 
V if cf{bp) = V. 
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4 Minimum Free Energy Under Boundary 
Constraints 

Let S' be a finite set of strings of length n over S and consider the configurations, 

~a (i) \ 

and 

£ ) 

where a, (3, ^,5 € S. We define mini (vi,V 2 ) as the minimum free energy of 2- 
cycles among all 2-cycles with the boundary configuration (ui, W 2 ) in a structured 
string x(T) with x ranging over S'^ . By minH{vi) we denote the minimum free 
energy of 1-cycles among all 1-cycles with the boundary configuration vi in 
a structured string x(T) with x ranging over S"*". The notation minD(v\) is 
defined to be the minimum free energy of free end structures among all free end 
structures with the boundary configuration v\ in a structured string x{T) with 
X ranging over S’*". 

For a string a € S and an integer i with 1 < f < n, we introduce the notation 
a:i which indicates the ith position of the string a. 

Then, for a set S, we define: 

V{S) = {a:i \ a€ S, 1 < i < n}. 

Consider two elements a:i,f3:j in V{S). 

(1) In case that either a ^ (I or i j holds, we define: 

VF(a:i,/3:j) = {a[z,n]7/3[l, j] | -f&S*}. 





(2) In case that a = (3 and i < j hold, we define: 

W{a:i,f3:j) = {a[i,n]jP[lJ] \ 7 G S'* } U {a[i,j\}. 

Theorem 1. For configurations, 

/£ (fc)\ 

and U 2 = ^ j with 9{a[i]) = P[j], 9{j[k]) = 

where a, /3,j,S G S, the following equations hold: 

(1) minH{vi) = min {E{ |a;) ) | a; G W{a\i, fl:j) }, 

(2) minI{vi,V 2 ) = min {E{ ^ ) | a; G W{a:i, j'-k), y G W{S:l, (3:j) }, 



(3) minD{vi) = min { ) | a;,y G S* }. 




Furthermore, minH{vi), minI{vi,V 2 ), and minD{vi) are computable in 0(|S|) 
time. 
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Proof. (1) The set of hairpins |x) such that x G W{a:i,l3:j) corresponds to 
the set of all hairpins which could have the boundary configuration vi. However, 
since there exist infinitely many such x’s, it is not clear whether the minimum 
of E{ |a;) ) exists or not. 

Recall that n is the length of each string in S and L is the constant in- 
troduced in Section 2.2. Let x G W{a:i, f3 : j) such that \x\ > L + 5n. Since 
|a;| > 5n, we can write x = a[z, n]a'7/3'/3[l, j] for some a' G S'^ and 7 G S'. 
Then, we have that x' = a[i,n]a' (3' j3[l, j] is also in W{a : i, j3 : j) and that 
|x| > \x'\ > L. Since the values of hi for |x) and \x') are equal to each other 
and h 2 {\x\) > h 2 {\x'\) holds (recall that /12 is weakly monotonically increasing), 
we have E{\x')) < E{\x)). Therefore, we can conclude that if(|x)) takes the 
minimum value for some x G W {a:i, (3:j) with \x\ < L + bn. Thus, the first equa- 
tion holds. Note that the value of S(|a;)) depends only on the loop length and 
the four bases x[l], x[2], x[|a;| — l],a:[|x|]. Since, for each loop length, all possible 
combinations of those four bases can be computed in 0(|S|) time, minE[{vi) can 
also be computed in 0(|S|) time. 



(2) The set of 2-cycles y such that x G W{a-.i,^-.k) and y G W{5'.l,j3:j) 



corresponds to the set of all 2-cycles which could have boundary configuration 
{vi,V 2 ). However, since there exist infinitely many such x’s and y’s, it is not 

^ ) exists or not. 



clear whether the minimum of E( 



Let X G W(a:i,j:k) and y G W{5-.l,(3:j). Suppose that |a;| > 6n. Since 
|x| > 6n, we can write x = o;[z, nja'ryy'y)!, fc] for some o', 7' G S'^ and 7 G S'. 
Then, we have that x' = is also in W{oi-.i,^\k) and that \x\ > 

\x'\ > 5n. If \y\ < 5n, set y' = y. Otherwise, we can write y = 5[l,n]5' pfi' 13[1, j] 
for some G S+ and p & S, and set y' = S[l , n]S' P' P[1 , j] . Then, we have 



y' G W{S:l, p-.j). Note that the loop length of 
Further, note that the loop length mismatch of 



to that of 



. Since the values of si, b\ and ci for 



is greater than that of . 
is greater than or equal 
are equal to those for 



respectively, we have ) S E{ y ) (recall that 62, 62 and 63 are weakly 
monotonically increasing) . 

In case that |y| > 6n, in a similar manner, we can show the existence of 
x' G W{a:i,j:k) and y' G W{S:l, P:j) such that E{^i) < E{y ). 



Therefore, we can conclude that E{ 



) takes the minimum value for some 



G W{a:i,^:k) and y G W{S:l,P:j) such that |a;| < 6n and |y| < 6n. 
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Thus, the second equation holds. Note that the value of E{ 



) depends only on 



the loop length, the loop length mismatch, and the eight bases x[l], x[2], x[|x| — 
1], a;[|a:|], y[l], y[2], y[|y| — l],y[|j/|]. Since, for each loop length and loop length 
mismatch, all possible combinations of those eight bases can be computed in 
OdS”!) time, minI{vi,V 2 ) can also be computed in 0(|S'|) time. 



(3) The set of free end structures such that x,y G S* corresponds 



to the set of all free end structures which could have boundary configuration 
> 

vi- It is clear that ) takes the minimum value for some x,y G S* 



such that |a;|, |y| < n. Thus, the third equation holds. Let s = xa[l,i] and 
t = P[j,n]y. Then, note that the value of depends only on the four 



bases s[|s| — 1], s[|s|], t[l], t[2]. Since for fixed lengths of s and t, all possible 
combinations of those four bases can be computed in 0(|S'|) time, minD{vi) 
can also be computed in 0(|S'|) time. □ 



5 Algorithm for Testing the Structure Freeness 

For a set S of strings, we construct a weighted directed graph G{S) = {V, E, w), 
called the Hydrogen Bond Network graph (HBN graph) of the set S, where V 
and E are defined as follows: 

V=V'u{d, h}, 

(~a (t)\ 

F' = { ( ^ (^.) I I a, 13 G S', 0{a[i]) = P[j], 1 < i,j <n}, 

E={V' X V') U {{d} X V) U {V' X {h}). 

Furthermore, the weight function w is defined as follows: 

(1) for vi,V 2 G V, we define: 

wi{vi,V2)) = minI{vi,V2), 



(2) for V G V , we define: 

w{{d,v)) = minD{v), 

(3) for V G V, we define: 

w{{v, h)) = minH{v). 

For a path p in G, by w{p) we denote the sum of weights of the edges contained 
in p, i.e., the weight of the path p. 
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Theorem 2. For a finite set S of strings of the same length and a real value F, 
the following two statements are equivalent. 

(1) There is an E-minimal structured string a(T) such that a G S~^ , T 0 and 
E{a{T)) = F, 

(2) There is a path p from d to h of the graph G{S) such that w{p) = F. 

Proof. ( 1 )— >( 2 ) : Let a{T) be an E-minimal structured string such that a G 

S'^ , T 7 ^ 0. Let cf{a{T)) = vo,...,Vk- Then, by Ci (i = we denote a 2- 

cycle in a{T) with the boundary configuration (vi-i,Vi), by cq we denote a free 
end structure in a{T) with the boundary configuration vq, and by Cfc+i we denote 
a hairpin in a{T) with the boundary configuration Vk- By the definition of the 
weight function w, for each i = we have w((ui_i,Ui)) = minI{vi-i,Vi). 

By the definition of minI{vi-i,Vi), E{ci) > minI{vi-i,Vi) holds for each i = 
Suppose E{ci) > minI{vi-i,Vi) for some i, and let c' be a 2-cycle with 
the boundary configuration {vi-i,Vi) such that E{c') = minI{vi-i,Vi). Then, 
by replacing Ci by c', we will obtain a new structured string a'{T') such that 
a{T) = a'{T') and E{a'{T')) < E{a{T)), which contradicts the fact that a{T) 
is E-minimal. Therefore, we have E{ci) = minI{vi-i,Vi) for each i = l,...,fc. 
Thus, for each i = l,...,/c, E{ci) = w{{vi-i,Vi)) holds. In similar ways, we 
have E{cq) = minD{vo) = w{{d,Vo)) and E{ck+i) = minH{vk) = w{{vk,h)). 
Therefore, the weight of the path: d ^ Vn Vh ^ h is equivalent to 

= E{a{T)). 

(2)->(l) : Let us consider a path p : d ^ vq ^ ^ Vk ^ h. For each 

i = l,...,fc, let Ci be a 2-cycle with the boundary configuration (vi-i,Vi) such 
that E{ci) = minI{vi-i,Vi). Furthermore, let cq be a free end structure with the 
boundary configuration vq such that E{cq) = minD{vo) and Ck+i be a hairpin 
with the boundary configuration Vk such that E{ck+i) = minH{vk). Then, we 
can obtain a structured string a{T) by concatenating Cq, ci, ..., Cfc, Cfc+i in this 
order so that for each i = l,...,fc-|- 1, the two base pairs on the boundary 
between Ci-i and Ci could be a common single base pair of both Ci_i and c^. 
Then, E{a{T)) is equivalent to the weight w{p) of the path p. Suppose that 
a{T) is not E-minimal. Then, there exists a structured string a'{T') such that 
a{T) = a'{T') and E{a'{T')) < E{a{T)). Let cl {i = l,...,fc) be a 2-cycle in 
a'{T') with the boundary configuration (vi-i,Vi), Cq be a free end structure in 
Qf'(r') with the boundary configuration vg, and be a hairpin in a'(T') with 
the boundary configuration Vk- By E(a'(T')) < E{a{T)), there exists some i 
{0 < i < k + 1) such that E{c'f) < E{a), which leads to a contradiction to the 
definition of either minD, mini, or minH . Thus, a(T) is E-minimal. □ 

By the above theorem, we can obtain the following simple algorithm to test 
the structure freeness of a given set S of strings of the same length: 

Input: A finite set S of strings of the same length. 

Output: Answer “yes” if for any structured string a{T) such that a G S~^ and 
T is linear, E{a{T)) > 0 holds, otherwise, answer “no”. 
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1 . Construct the HBN graph G{S). 

2 . Apply to G(S) an algorithm for the single-source shortest-paths problem in 
which edge weights can be negative. If there exists no negative-weight path 
from d to h, answer “yes” . Otherwise answer “no” . 

For the single-source shortest-paths problem in which edge weights can be 
negative, we can use the Bellman-Ford algorithm ([4], [9]). Given a weighted and 
directed graph G = (V, E) with source s and weight function ru : if ^ R, the 
Bellman-Ford algorithm returns a Boolean value indicating whether or not there 
exists a negative-weight cycle that is reachable from the source. If there is such 
a cycle, then the algorithm indicates that no shortest path exists. If there is no 
such cycle, then the algorithm produces the weights of the shortest paths. 

Note that every vertex in G{S) has a path to h. Thus, the existence of a 
negative-weight cycle reachable from d implies the existence of negative-weight 
path from d to h. So, we may answer “no” if there exists a negative-weight cycle 
that is reachable from the vertex d. Otherwise answer “no” if the computed 
weitht W of the path from dtohis negative. If W is not negative, answer “yes”. 

Let S' be a given finite set of strings of length n. Let m be the number of 
strings in S. Then, the number of vertices of G{S) is 0{rn?n'^). Therefore, the 
number of edges of G(S) is As discussed in Theorem 1, since the time 

necessary for computing the weight of an edge is 0{m), the time necessary for 
the construction of G(S) is 

The time complexity of Bellman-Ford algorithm is 0(|y||if|). Consequently, 
the time complexity of the second step of the proposed algorithm is 0(m®n®). 
Therefore, the proposed algorithm runs in 0(m®n®) time in total. 

6 Conclusions 

In this paper, we focused on the problem of testing the structure freeness of 
5+ for a given set S of sequences. We gave a partial answer to this problem, 
and proposed an efficient algorithm to test the structure freeness of under 
the constraint that every string may form only linear secondary structures. We 
are continuing our study in order to devise an efficient algorithm for solving the 
general problem in which sequences may form multiple loop structures. 
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Abstract. Languages of cyclic words and their relation to classical word 
languages are considered. To this aim, an associative and commutative 
operation on cyclic words is introduced. 



1 Introduction 

Related to DNA and splicing operations (see [1,2,4,11], etc.) it is of interest to 
investigate characterizations of languages of cyclic words. 

Languages of cyclic words can be generated in various ways. One possibil- 
ity is to consider languages of classical type, like regular, linear, or context-free 
languages, and then take the collection of all equivalence classes of words with 
respect to cyclic permutation from such languages. Another possibility is to con- 
sider languages of cyclic words defined by rational, linear, and algebraic systems 
of equations via least fixed point, with respect to an underlying associative oper- 
ation on the monoid of equivalence classes. If the operation is also commutative, 
the classes of rational, linear, and algebraic languages coincide [10]. In the case 
of catenation as underlying operation for words such least fixed points give reg- 
ular, linear, and context-free languages, respectively. A third way is to define 
languages of cyclic words by the algebraic closure under some (not necessarily 
associative) operation. A fourth way to generate languages of cyclic words is 
given by rewriting systems analogous to classical grammars for words, as right 
linear, linear, context-free, monotone, and arbitrary grammars [9] . 

For all notions not defined here we refer to [5,12]. 

An associative and commutative operation on cyclic words is introduced be- 
low. It is shown that the first two ways of defining languages of cyclic words do 
not coincide. It is also shown that the classical classes of regular and context-free 
languages are closed under cyclic permutation, but the class of linear languages 
is not. 

2 Definitions 

Let V be an alphabet. A denotes the empty word, ja;j the length of a word, and 
\x\a the number of symbols a G R in x. 

Furthermore, let REG, LIN, CF, CS, and RE denote the classes of regu- 
lar, linear, context-free, context-sensitive, and recursively enumerable languages, 
respectively. 



N. Jonoska et al. (Eds.): Molecular Computing (Head Festschrift), LNCS 2950, pp. 278—288, 2004. 
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Definition 1. Define the relation ~ C y* x F* by 

X ^ y X = af Ay = f3a for some a, f3 €. V* 



Lemma 1. The relation ~ is an equivalence. 

Proof. Trivially, ~ is reflexive since x = Xx = xX = x, and symmetric by 
definition. 

~ is also transitive since x ^ y, y ^ z implies x = a(3, y = [3a = 7<5, z = 6^. 
If \[3\ < I 7 I then 7 = [3p, a = pS, and therefore x = p6(3, z = 5[3p. Hence 
X ^ z 

If \[3\ > I7I then [3 = 717, 5 = aa, and therefore x = a'ya, z = aa^. Hence 
z ^ X. 

Thus ~ is an equivalence relation. □ 



Definition 2. Denote by [x] the equivalence class of x consisting of all cyclic 
permutations of x, and by Cy = V* [ ~ the set of all equivalence classes of 



Definition 3. For each cyclic word [x] a norm may be defined in a natural way 
IIWII = l^^l- Clearly, ||[x]|| is well defined since |^| = |x| for all ^ G [x]. The 
norm may be extended to sets of cyclic words by 

||H|| = max{||[x]|| | [x] G H}. 

It is obvious from the definition that ||{[x]}o{[j/]}|| = ||{[x]}|| + ||{[j/]}||, and 
therefore ||H o H|| < ||H|| + ||H||. 

The next aim is to define an associative operation on 2 '^'^ . 

Definition 4. Define an operation © on 2^^ as follows: 

{W}©{[j/]} = {N I Cg W,??G [j/]}. 

Note that {[A]} © {[x]} = {[x]} © {[A]} = {[x]}. 

Unfortunately, we see that © is only commutative but not associative. 

Lemma 2. The operation Q is a commutative but not an associative. 

Proof {[x]} © {[y]} = {[fp] I ^ G [x],p G [y]} 

= {m\^G[x],vG[y]} = {[y]}Q{[x]}. 

Thus © is commutative. 

am © {[c]}) © {[d]} = {[abc], [acb]} © {[d]} 

= {[abed], [adbc], [abdc], [acbd], [adeb], [aedb]}, 

{[o6]} © ({[c]} © {[d]}) = {[«&]} © {[cd]} = {[abed], [aedb], [abdc], [adeb]}. 
Thus © is not associative. □ 



Another operation © can be defined as follows. 
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Proposition 2. (Iteration Lemma) For any M G RAT(o) there exists a num- 
ber N € IN such that for all [a;] G M with 1 1 [x] 1 1 > N there exist [t6] , , [w] G Cy 

such that the following holds: 

||MoH||<iv, ||H||>o, 

Vfc > 0 {[u]} o o {[w]} C M . 



Definition 7. For any word language L C V* define the set of all equivalence 
classes with respect to ~ by k{L) = {[ x ] \ x € L}. k is chosen for the Greek word 
KvnXog meaning circle. 

For any set M € Cy let 



7(M)= U [x] C F*. 

[x]GM 

7 is chosen for jpaiipti] meaning line in Greek. 

Trivially, kj{M) = M. But in general, jk{L) ^ L. 

”fK{L) represents the closure of L under cyclic permutation. 

Example 1. Let V = {a,b,c,dj. Then: 

Ua]}o{lbJ} = {[abJ}, 

{[a]}o{[a]}o{[b]} = {[aa]}o{[b]} = {[aab]}, 

{[a]} o {[&]} o {[c]} = {[ab]} o {[c]} = {[a]} o {[be]} = {[abc], [acb]}, 

{[«]} o {[b]} o {[c]} o {[d]} = {[a]} o {[be]} o {[d]} = {[a6]} o {[cd]} 

= {[abed], [abde], [adbe], [acbd], [aedb], [adeb]}, 

{[adc]} o {[d]} = {[abed], [adbd\, [addc]}, 

{[ad]} o {[ad]} = {[aadd], [adad]}. 

Definition 8. (Primitive Gyclic Words) A cyclic word x is called primitive with 
respect to o iff it does not fulfill {[x]} C {[y]}^ for some y G V* and some k > 1, 
where the power is meant with respect to o. 

Note that if x is primitive with respect to catenation •, then all f G [x] are 
primitive with respect to •, too. 



Definition 9. Let A C Cy. Then the (B-algebraic closure L(^{A) of the set A, 
with 0 G {©, 0, o}, is defined by 

Aq = A, 

= U {W}©{[d]}, 

[x],[y]eAj 

oo 

(^) = U ■ 

i=o 

Furthermore, if X is any class of sets (languages), let ACL 0 (X) denote the 
class of all ©-algebraic closures of sets A G X. 
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Definition 5. Consider 

U {[a^ar]] \ G [x],ari G [y]}. 

aGV 

Note that {[a;]} 0 0 = 0 0 {[a;]} = 000 = 0 and 
{[x]}0{[A]} = {[A]}0{[x]} = 0. 

Also this operation is commutative but not associative. 

Lemma 3. The operation is a commutative hut not an associative. 

Proof. Commutatitivity is obvious from the definition. But 

({[ac]} 0 {[a&]}) 0 {[&c]} = {[acab]} 0 {[6c]} = |[6aca6c], [ca6ac6]|, 

|[ac]} 0 ({[a6]| 0 {[6c]}) = {[ac]} 0 {[6a6c]} = {[aca6c6], [cac6a6]}, 
showing that 0 is not associative. □ 

Thus, we define another operation o, using the shuffle operation lu, as follows. 

Definition 6. 

{N}o{[j/]} = {M I tG {C}lu{? 7 },^G [a:],? 7 G [y]} = {[t] ] rG Nlu[ 2 /]}. 

This operation may be called the shuffle of cyclic words. 

Lemma 4. The operation o is a commutative and associative. 

Proof. Commutativity is obvious since lu is a commutative operation. 

H G ({ Wl o {[y]}) o {WI ^ 3 r G [x] m [y] [a] G {[r]} o {[^]} 

3 T G [x] LU [j/] (T G [r] LU [z] 

cr G ([x] LU [y]) LU [z] = [x] LU ([y] lu [z]) 

3 p G [y] LU [z] CT G [x] LU [p] 

3 p G [y] LU [z] [cr] G {[x]} o {[p]} 
^HG{[x]}o({[y]}o{[z]}). 

Therefore, o is associative. □ 

Consequently, the structure Me = (2*''^ , {[A]}) is a monoid, and the struc- 
ture Sc = , U, o, 0, {[A]}) is an w-complete semiring. 

Thus, systems of equations may be defined [10]. Since o is commutative, the 
classes of rational, linear and algebraic sets coincide. The class of such sets will 
be denoted by RAT(o). 

Proposition 1. Any M G RAT(o) can he generated by a right-linear o- 
grammar G = (Z\, A7, S, P) where A is a finite set of variables, S C Cy a 
finite set of constants, S G A, and P a finite set of productions of the forms 
X^Y o {[x]}, A^{[x]} with [x] G E, and [x] = [A] only for S'^{[A]} and S 
does not appear on any other right hand side of a production. 
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Lemma 5. { [a^] } ° { [j/] } = K(7([x])Lu([y])), Ao B = and {a;i]}o 

• • • o {[a;fe]} = k(7([xi]) LU- • •LU7([a:fc])) for k>l. 

Proof. The first statement is just the definition of o. 

AoB = U[x]eA.[y]eB{N}o{b]} = UHeA,fe]Gs'«(7(N)LU7(M)) 

= «(UMeA.[y]eB7(N)LU7(M)) = k(7(^) lu7(^))- 
The last statement follows by induction on k, using the second statement. □ 

3 Results 

The first theorem presents a proof of Exercises 3.4 c in [6] or 4.2.11 in [5], 
respectively. (We give a proof because of technical reasons.) 

Theorem 1. REG is closed under cyclic permutation: 7 k(REG) C REG. 

Proof. Let G = {Vn,Vt,S,P) be a type-3 grammar generating the language 
L = L(G). Assume that G is in normal form, i.e. the productions are of the form 
A^aB or A^a, and S does not occur on the right-hand side of any production. 
Assume also that each x GVnUVt is reachable from S. 

From G construct a new type-3 grammar Gi = {V\n, V\t, S\,Pi) with 

ViN = Vat X Vat U Vat X Eat U {S'!}, 

ViT = Vt, and productions 
Pi = {Si^(B,A} I BA^aB € Pj 

U {(B, A)^a(G, A) \ B^aG G P} U { {B, A)^a{S, A) \ B^a G P} 

U {{B,A)^a{C,A) I B^aGeP}U {(A,A)^a | A^aGP}. 

Then a derivation 

S'=^aoAi=^ • • • =^oo • • • Ofc-i Afc=^ao • • • auAu+i^ • • • 

=^ao ' ' ' * * • a.jjiAjn+i^o,Q • • • a^ak+i ■ ■ * Om0.m+i 

implies a derivation 

, Af^')=^ak-\-i (A/,;_|_2 , A]f^=^a]^j^\ • • • Ora , A}f^ 

* * * amUm-t-i (*5^5 A]fj=^ak+i * * * Um+iao(Ai, A}f) 

^ ' * * O.k+1 * * * O.m+1^0 * * * Ofc— 1 (Afc, Af^')=^ak-\-l ' ' ' Um-t-iao * * * O-fc, 
and therefore 

*S'i=^fc+i • • • O-m+l^O * * * (^m- 
On the other hand, a derivation 

(^fc+2 j (^m+1 5 -^k) 

■) * * * ^m+lQ'o(^l? -^k) 

^ ' ' * ^fc+1 * * * 0-m+ino * * * 0,k—l ^k)^^k-\-l ' ' * ^m+lO-0 * * * ^k 

implies derivations 

' * * ^mO-m+1 and 

S=^ct()jA.i=^ • • • (7-0 ’ ’ ’ ’ ’ ‘ , 

and therefore 

S^CIq • • • • • • Q,rn-\-l- 

Thus, T(G'i) = 7^^(T), the set of all cyclic permutations of L. 
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The proper inclusion follows from the fact that there are regular languages 
not being closed under cyclic permutation, like L = {a™6” | m, n > 0}. □ 

Theorem 2. LIN is not closed under cyclic permutation, i.e., 7 k(LIN) 

LIN. 

Proof. Consider the language L = {a"5” | n > 0} G LIN. Assume 'Jk{L) G 
LIN. Thus, G By the iteration lemma for linear context-free 

languages there exists N = N{L) G IN such that for k > N there are words 
u,v,w,x,y such that 2 = = uvwxy with \uv\ < N, \xy\ < N, \vx\ > 0, 

and Vj > 0 : uv^wx^y G L. But then uv G {a}*, xy G {a}*, and only a’s can be 
iterated, contradicting the fact that \z\a = \z\b for all z G 7k(T). □ 

The next Theorem is from [6] . 

Theorem 3. CF is closed under cyclic permutation: 7 k(CF) C CF. 

Proof. 7 k(CF) C CF is Exercise 6.4 c and its solution in [6], pp. 142-144. The 
problem is also stated as Exercise 7.3.1 c in [5]. 

The proper inclusion follows again from the fact that there are context-free 
(even regular) languages not being closed under cyclic permutation, like the 
example in Theorem 3.1. □ 

Theorem 4. kREG (f. RAT(o). 

Proof. Consider L = {(o6)" | n > 0} G REG. From this follows that k{L) = 
{[(a6)”] I n > 0} with [(afe)”] = {(o5)", (6a)"}. 

Now assume k{L) G RAT(o). Then, by the iteration lemma for RAT(o) 
there exists a N G IN such that for fc > A there exist u, v, w such that [(a6)^] G 
{Ml ° {Ml ° {[HI with ||{[■u]} o {[u]}|| < N, ||[u]|| > 0, and with Vj > 0 : 
o{MI C k{L). 

Now [uvru] G k{L) and [uvvw] G k{L). Because of ||[u]|| > 0 it follows that 
V = av' or V = bv' . From \v' a] = [av'] and [v'b] = [6w'] follows that [uv'aav'w] G 
k{L) or [uv'bbv'w] G k(L), a contradiction to the structure of k{L). □ 

Theorem 5. 7 RAT(o) (f, CF. 

Proof. Consider the language M of cyclic words defined by M = {[a]}oMo{[6c]|. 
Clearly, 

M= U {[«]}' 

k>0 

Now, {[a]}^ = and [6^c^] G {[6c]}^ for fc > 0 which can be shown 

by induction on k, namely [6c] G {[6c]|, and assuming [6^c^] G {[6c]}^ follows 
[bk+i^k+i^ = [6'=c'=c6] G {[6^c'=]| o {[6c]| C {[6c]}'=+b 

From this follows that [o'=6'=c'=] G {[a'=]| o {[6'=c'=]| = {[a]}'= o {[6'=c'=]| C 
{[a]}'=o{[6c]}'= C M. 

Therefore, y(M) n {a|*{6}*{c}* = {a'=6'=c'= | fc > 0} ^ CF. 

Thus, 7 (M) CF. 



□ 
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Theorem 6. RAT(o) ^ kCF. 

Proof. Assume the contrary, RAT(o) C kCF. From this follows 7RAT (o) C 
7kCF C CF, a contradiction to Theorem 5. □ 



Theorem 7 . 7kREG ^ 7RAT(o). 

Proof. Assume the contrary, y^REG C 7RAT(o). From this follows kREG = 
/ty/tREG C KyRAT(o) = RAT(o), a contradiction to Theorem 4 □ 



Theorem 8. REG ^ y/tCF. 

Proof. Consider the language L = {a}{6}* G REG. L ^ y^CF is obvious since 
L is not closed under cyclic permutation. □ 



Theorem 9. y/tLIN C y^CF. 

Proof. Consider the language L = f"' | fc,m,n > 0} G CF and 

assume the contrary, jk{L) G y^LIN. Then there exists a language L' G LIN 
such that 7 k(L') = 'jk{L), and L' contains at least one cyclic permutation of 
jn £qj. triple {k,m,n) G (FV \ {0})^. Thus \L'\ = 00. 

Define the languages 

Li = L'n {a}* • {b}* ■ {c}* • {d}* ■ {e}* ■ {f}*{a}* 

U L'n {b}* ■ {c}* • {d}* ■ {e}* ■ {/}* • {a}* ■ {b}*, 

L2 = L'n {c}* • {d}* ■ {e}* ■ {/}* • {a}* • {b}*{c}* 

U L' n {d}* ■ {e}* ■ {/}* • {a}* ■ {b}* ■ {c}* ■ {d}*, 

L3 = L' n {e}* • {/}* • {a}* • {b}* ■ {c}* • {d}*{e}* 

UL'n {/}* • {a}* • {b}* ■ {c}* • {d}* ■ {e}* • {/}*. 

Because LIN is closed under intersection with regular languages, Li, L2, L3 G 
LIN, and it is obvious that L' = Li U L2 U L3. 

Now define the homomorphisms hi,h2, ^3 by 

hi{a) = hi{b) = X, hi{c) = c, hi{d) = d, hi{e) = e, hi{f) = /, 

/12(a) = a, h2{b) = b, /12(c) = /12(d) = A, /12(e) = e, /i2(/) = /, 

/13(a) = a, /i3(&) = b, h^{c) = c, /13(d) = d, /13(e) = /i3(/) = A. 

Because LIN is closed under morphisms, /ii(Li), /i2(L2), h3{L^) G LIN. Fur- 
thermore, /ii(Li) contains only words of the form d'^ , /i2(L2) only such 
of the form e^f^a’^b^, and h^{L^) only such of the form a’^b^d^d”^. 

At least one of the languages hj{Lj), j = 1,2,3, has the property that both 
indices {{m,n), {k,n), or {k,m), respectively) are unbounded. 

Assume the contrary, namely that one index is bounded. This implies for the 
languages Lj, j = 1,2,3, that at most two indices are unbounded {k,m), {k,n), 
{k, to), (to, n), {k, n), or (to, n), respectively). But from this follows that there is 
no triple (fc, to, n) with all three indices unbounded, contradicting the property 
of V. 

Without loss of generality assume that /ii(Li) has the above property, imply- 
ing that hi{Li) contains words c™d™e"d" with both indices {m,n) unbounded. 
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Now for hi{Li) the iteration lemma for linear languages holds, with a con- 
stant iV > 0 . Thus there exist m,n> N such that G /ii(Li). But the 

iteration lemma implies g hi{Li) for j > 0 , a contradiction to 

the structure of hi{Li). 

The two other cases are shown in the same way. Therefore, jk^L) ^ 

7 kLIN. □ 

Note that ^n{L) G ywLIN for L = {a^b^cPd^ I L j > 0 } ^ LIN. This is seen 
by considering L' = {b^'c^d^a^ \ i,j> 0 }. 

Theorem 10 . kLIN C kCF. 

Proof. Assume kLIN = kCF which implies y/tLIN = 7 kCF, a contradiction 
to Theorem 9 . □ 

The next two theorems state the closure of context-sensitive and recursively 
enumerable languages under cyclic permutation, also cited in [ 5 ]. 

Theorem 11 . 7kCS C CS. 

Proof. This is obvious by adding productions to a length-increasing grammar G 
with L{G) = L, achieving all cyclic permutations of words w G L. The proper 
inclusion follows from the fact that there exist context-sensitive languages not 
being closed under cyclic permutation. □ 

Theorem 12 . 7kRE C RE. 

Proof. This is shown in a way analogous to Theorem 11 . □ 

For the following note that lu is an associative and commutative operation 
on words. For it, propositions analogous to Propositions 1 and 2 hold. 

Lemma 6 . 7(K({a;} lu {y})) = 7(K({a;})) lu7(k({j/})) for any x,y G V* . 

Proof. Any w G {x} lu {y} has the form w = u\Vi • • • with x = u\ • • - Un 
and y = vi ■ • • Vn where it is also possible that Ui = X and vj = A for some i,j. 
Then z G 7(K({a;} lu {y})) has the form either z = uaVi ■ ■ ■ UnVnUiVi ■ ■ • un or 

Z = Vj2Uj+i • • • UnVnUiVi • • ■ UjVji with Ui = Ui\Ui2 and Vj = VjiVj2. 

Then either z = Ui2Vi2 • • • UnVnUiVi • • • unVn with vi2 = Vi, vn = A, and 
Ui2 • • • UnUi G [x], Vi 2 --- VnVi ■ ■ ■ Vn G [y] , 

or z = Uj2Vj2Uj+i---UnVnUiVi---UjiVji with Uj2 = A, Uji = Uj, and 

Uj 2 ■ ■ ■ UnUi ■ ■ ■ Uji G [x], Vj 2 ' ' ' V„Vi ■ ■ ■ Vji G [y] . 

Therefore z G 7(k({ 2;})) Luydy})). 

On the other hand, if z G 7 (k({2;})) lu 7 ({y})), then x = ui ■ • • unua • • • Un, 
y = vi - ■■ ViiVi2 ■■■Vn, and z = Ui2Vi2 ■ ■ ■ • • • UiiVu, where possibly Ui = 

A and Vj = A for some i,j. 

But w = u\Vi ■ • • UiiViiUi2Vi2 ■ ■ ■ UnVn G {x} LU {y}, and therefore it follows 
that z = Ui2Vi2 ■ ■ ■ UnVnUiVi ■ ■ ■ unVn G j{k{{x} LU {y})). 

Hence 7 (k({2;} lu {y})) = 7(K({a;})) lu 7(K({y})). 



□ 
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Theorem 13. '){k{AiuB)) = 'y{k{A)) lu - f{K{B)) for any A,B CV* . 

Proof. 

'Y{k{AluB)) = (J {x}Lu{y}))= |J 7(Ac({a;} lu {y})) 

x&A.y&B x&A.y&B 

= U (7(K({a;})LiJ7(K({y})) =7 (k(t1))lu7(k(B)). 



Lemma 7. 7(K({a;i} lu • • • lu {xfc})) = 7(K({a:i})) lu • • • 7(«;({xfc})) for Xi G V* 
and k G IN. 

Proof. This is shown by induction on k. This is obviuos for k = 1. For k = 2 
this is Lemma 6. Now, using Theorem 13, 

7(K({a;i} LU • • • LU {a;^} lu {xfc+i)) 

= 7(K({a;i}LU • • •Lu{a;fe})) LU7(«;({a;fe+i)) 

= 7(K({a;i})) LU • • • LU7(/t({a;fe})) LU7(«;({a;fc+i)). □ 

Lemma 8. 7kRAT(lu) C RAT(lu). 

Proof. Consider L G RAT(lu). Then L can be generated by a right-linear lu - 
grammar G = {A, E, S, P) with productions of the form X^Y uj{a:} or A— >{a;} 
where x G E. S G A, and all sets A, E C V*, and P are finite. 

Therefore, for all a: G L there exist n G IN and Xi G E such that x G 
{a;i} LU • • • {xn} with Xi G E. 

Now construct a new right-linear lu -grammar G" = (Z\, E' , S, P') with 

E' = UiceiN &nd productions 

P' = {X^Ym{y} I X^Y{x} G P,y G [a:]} 

U {A^{y} I A^{a;} G P,y G [a;]}. 

Now X G L implies 5'^{a:i} lu • • • lu {a;„} in G for some n G IN and Xi G E. 
By the construction of P' follows that S'=^7{yi}uj- • -Lulj/n} in G' for all yi G [xi]. 
By Lemma 7, this implies 

7(K({a:i})) LU • • • LU7(K({a;„})) = 7(K({a:i} lu • • • lu {a;„})) C L{G'). Therefore 
7(K{a;})) G L{G') for all x G L, hence "f{n{L)) C L{G'). 

On the other hand, z G L(G') implies z G {yi} lu • • • lu {y„} for some n G IN 
and yi G [xi], Xi G E. From this follows that 

z G 7(K({a;i}))uj- • •uJ7(K({a;„})) = 7(K({a;i}uj- • •uj{a;„})), hence z G j{k{L)), 
implying L(G') C j{k{L)). 

Therefore L(G') = 

Consider L = G RAT(lu). Trivially, L ^ 7kRAT(lu) since L is not 

closed under cyclic permutation. E.g., ab, abab G L, but ba, baba ^ L. □ 

Theorem 14. RAT(lu) C CS , kRAT(lu) C kCS. 

Proof. RAT(lu) C CS has been shown in [7,8], from which follows kRAT(lu) C 

kCS. 
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The proper inclusions follow from the fact that all languages in RAT(lu) are 
semilinear, but CS contains non-semilinear languages. □ 



RE 




CS 7kRE 




CF RAT(lu)7kCS 




LIN 7kCF 7kRAT(uj) 

REG 7kLIN 7RAT(o) 




7«:REG 
Figure 1 



kRE 



kCS 




kRAT(lu) kCF 



RAT(o) kLIN 



kREG 



Figure 2 



Theorem 15. RAT(o) C kRAT(lu). 

Proof. Consider M G RAT(o). Then there exists a right-linear o-grammar G = 
{A, S, S, P), generating M = L{G), with U C V* and productions of the form 
X^Y o {[x]} or A^{[x]}, X G S. Let [z] € M. Then z G {[a^i]} o • • • o {[x„]} for 
some n G JN, Xi G S, and o • • • o {[x„]} in G. 

Now {[xi]} o • •• o {[x„]} = {[r] I T G {Ci} lu • • • lu {^„}, G [x*]}. 

Construct a right-linear lu -grammar G' = {A, X', S, P') with 

S' = UicGiN &nd productions 

P' = {X^Ym{f} I ^G [x],A^Fo{[x]}gP} 

U \f€[x],X^{[x]}GP}. 

Then lu • • • lu {^n} in G' iff G [xi\ and o • • • o {[x„]} in G. 

Since 

k({^i}lu---lu{^„}) = {[r] I r G {^i} lu • • • lu {^„}} C {[xi]} o • • • o {[x„]} it 
follows that k{L{G')) C M. 

On the other hand, this is true for all G [x^]. Therefore it follows that 
{[xi]} o • • • o {[x„]} C k{L{G')), hence M C k{L{G')). 

Thus M = k{L{G')). 



□ 




288 



Manfred Kudlek 



Theorem 16 . 7RAT(o) C 7 kRAT(lu). 

Proof. This follows immediately from Theorem 15. □ 

The relations between the language classes considered so far are shown in 
the diagrams of Figures 1 and 2. 

4 Conclusion 

An open problem is whether the inclusions RAT (o) ^ kRAT (lu) and 

7RAT (o) C 7 kRAT(uj) are proper or not. Other problems to be solved are 
the relation of language classes defined by grammars on cyclic words to those 
language classes considered above, and closure properties under certain language 
operations. Also the relations of language classes defined as algebraic closure un- 
der the operations © and © have to be investigated. These are related to simple 
circular splicing operations [3]. Another open problem is to find alternative as- 
sociative operations on cyclic words to generate classes of languages of cyclic 
words. Of interest is also the investigation of the language of primitive cyclic 
words and its relation to the set of primitive words. 
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Abstract. This paper describes the design of a linear time DNA algo- 
rithm for the Hamiltonian Path Problem (HPP) suited for parallel imple- 
mentation using a microfluidic system. This bioalgorithm was inspired by 
the algorithm contained in [16] within the tissue P systems model. The 
algorithm is not based on the usual brute force generate/test technique, 
but builds the space solution gradually. The possible solutions/paths are 
built step by step by exploring the graph according to a breadth-first 
search so that only the paths that represent permutations of the set of 
vertices, and which, therefore, do not have repeated vertices (a vertex is 
only added to a path if this vertex is not already present) are extended. 
This simple distributed DNA algorithm has only two operations: concate- 
nation (append) and sequence separation (filter). The HPP is resolved 
autonomously by the system, without the need for external control or 
manipulation. In this paper, we also note other possible bioalgorithms 
and the relationship of the implicit model used to solve the HPP to 
other abstract theoretical distributed DNA computing models (test tube 
distributed systems, grammar systems, parallel automata). 



1 Introduction 

A microfluidic device, microflow reactor or, alternatively, “lab-on-a-chip” (LOG) 
are different names of micro devices composed basically of microchannels and 
microchambers. These are passive fluidic elements, formed in the planar layer 
of a chip substrate, which serve only to confine liquids to small cavities. In- 
terconnection of channels allows the formation of networks along which liquids 
can be transported from one location to another of the device, where controlled 
biochemical reactions take place in a shorter time and more accurately than in 
conventional laboratories. There are also 3D microfluidic systems with a limited 
number of layers. See [20,21] for an in-depth study of microflow devices. 
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The process of miniaturization and automation in analytical chemistry was 
an issue first addressed in the early 1980s, but the design (year 1992) of the 
device called “Capillary electrophoresis on a chip” [15] was the first important 
miniaturized chemical analysis system. Though still in its infancy, the interest in 
this technology has grown explosively over the last decade. The MIT Technology 
Review has rated to microreactors as one of the ten most promising technologies. 

Microelectromechnical (MEMS) technologies are now widely applied to mi- 
croflow devices for fabrication of microchannels in different substrate materials, 
the integration of electrical function into microfluidic devices, and the devel- 
opment of valves, pumps, and sensors. A thorough review of a large number 
of different microfluidic devices for nucleic acid analysis (for example, chemical 
amplification, hybridization, separation, and detection) is presented in [20]. 

These microfluidic devices can implement a dataflow-like architecture for 
the processing of DNA (see [11] and [17]) and could be a good support for 
the distributed biomolecular computing model called tissue P systems [16]. The 
underlying computational structure of tissue P systems are graphs or networks 
of connected processors that could be easily translated to microchambers (cells 
or processors) connected with microchannels. 

There are several previous works on DNA computing with microfluidic sys- 
tems. In one of them, Gehani and Reif [11] study the potential of microflow 
biomolecular computation, describe methods to efficiently route strands between 
chambers, and determine theoretical lower bounds on the quantities of DNA and 
the time needed to solve a problem in the microflow biomolecular model. Other 
two works [5,17] solve the Maximum Clique Problem with microfluidic devices. 
This is an NP-complete problem. McCaskill [17] takes a brute-force approach 
codifying every possible subgraph in a DNA strand. The algorithm uses the 
so-called Selection Transfer Modules (STM) to retain all possible cliques of the 
graph. The second step of McCaskill’s algorithm is a sorting procedure to deter- 
mine the maximum clique. By contrast, Whitesides group [5] describes a novel 
approach that uses neither DNA strands nor selection procedures. Subgraphs 
and edges of the graph are hard codified with wells and reservoirs, respectively, 
connected by channels and containing fluorescent beads. The readout is a mea- 
sure of the fluorescence intensities associated with each subgraph. The weakness 
of this approach is an exponential increase in the hardware needed with the 
number of vertices of the graph. 



2 A DNA Algorithm for the HPP 

2.1 A Tissue P System for the HPP 

We briefly and informally review the bioinspired computational model called 
tissue P systems (tP systems, for short). A detailed description can be found in 
[16]. 

A tissue P system is a network of finite-automata-like processors, dealing with 
multisets of symbols, according to local states (available in a finite number for 
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each “cell”), communicating through these symbols, along channels (“axons”) 
specified in advance. Each cell has a state from a given finite set and can process 
multisets of objects, represented by symbols from a given alphabet. The standard 
evolution rules are of the form sM s'M' , where s, s' are states and M, M' 
are multisets of symbols. We can apply such a rule only to one occurrence of 
M (that is, in a sequential, minimal way), or to all possible occurrences of M 
(a parallel way), or, moreover, we can apply a maximal package of rules of the 
form sMi s'M^, 1 < i < k, that is, involving the same states s, s', which can 
be applied to the current multiset (the maximal mode) . Some of the elements of 
M' may be marked with the indication “go” , and this means that they have to 
immediately leave the cell and pass to the cells to which there are direct links 
through synapses. This communication (transfer of symbol-objects) can be done 
in a replicative manner (the same symbol is sent to all adjacent cells), or in a 
non-replicative manner; in the second case, we can send all the symbols to only 
one adjacent cell, or we can distribute them non-deterministically. 

One way to use such a computing device is to start from a given initial con- 
figuration and to let the system proceed until it reaches a halting configuration 
and to associate a result with this configuration. In these generative tP systems 
the output will be defined by sending symbols out of the system. To this end, 
one cell will be designated as the output cell. 

Within this model, authors in [16] present a tP system (working in the maxi- 
mal mode and using the replicative communication mode) to solve the Hamilto- 
nian Path Problem (HPP). This problem involves determining whether or not, in 
a given directed graph G = {V, A) (where V = {wi, . . . , w„} is the set of vertices, 
and A C y X P is the set of arcs) there is a path starting at some vertex Vm, 
ending at some vertex Vout, and visiting all vertices exactly once. It is known 
that the HPP is an NP-complete problem. 

This tP system has one cell, di, associated with each vertex, p, of the graph 
and the cells are communicating following the edges of the graph. The output cell 
is (T„. The system works as follows (we refer to [16] for a detailed explanation). 
In all cells in parallel, the paths z € V* in G grow simultaneously with the 
following restriction: each cell di appends the vertex Vi to the path z if and only 
if Vi ^ z. The cell can work only after n — 1 steps and a symbol is sent out 
of the net at step n. Thus, it is enough to watch the tP system at step n and 
if any symbol is sent out, then HPP has a solution; otherwise, we know that no 
such solution exists. (Note that the symbol z sent out describes a Hamiltonian 
path in G.) 



2.2 The DNA Algorithm 

The tP system for the HPP described above can be easily translated to a paral- 
lel and distributed DNA algorithm to be implemented in a microfiuidic system. 
Remember we have a directed graph G = {V, A) with n vertices. Our algorithm 
checks for the presence of Hamiltonian paths in the graph in linear time; more- 
over, it not only checks for a unique pair of special vertices Vm and font, but for 
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all possible choices of the pair {vm, Vout) from V x V. Firstly, we describe the 
overall operations of the algorithm: 

— Coding-. Each vertex Vi has a short single strand associated. Each of these 
single strands must be carefully selected to avoid mismatch hybridizations 
along the run of the algorithm (see [3] for DNA words design criteria). 

— In parallel, in {n — 1) steps and starting from each vertex of the graph 
the algorithm grows paths with non-repeated vertices (searching the graph 
breadth-first and starting from each vertex vi). 

— Time t = n — 1. Read the outputs of the system (empty or paths with n 
vertices non-repeated) . If there are no output strands, then the graph has no 
Hamiltonian paths; otherwise, there are Hamiltonian paths (for some pair of 
vertices Vm and Vout)- 

The hardware of our system for a graph G = (V, A) with n vertices is com- 
posed of one planar layer with 2n -I- 1 chambers: n filtering chambers Fi, n 
append/concatenation chambers C'i, and one DNA sequencing microchip for 
reading the Hamiltonian paths. 

Chamber Fi. Inputs: DNA strands z € V* from chambers Cj such that 
(j, i) G A. Function: Retains the strands that contain the substrand associated 
with vertex Vi. Output: Strands that do not contain the substrand associated 
with Vi. 

Chamber Ci. Inputs: DNA strands z G V* from the output of Fi. Function: 
Appends to every strand z in its input the substrand associated to vertex Vi. 
Output: zvi. 

Pattern of connectivity (layout). The output of each chamber Fi is connected 
to the input of the associated chamber Ci. For all (i,j) G A, there is a channel 
from chamber Ci to chamber Fj and there is a pump that forces the flow of 
liquid from Ci to Fj. 

In this way, the chambers and the channels of the microfluidic system codify 
the vertices and edges of the graph G. The orientation of the edges is codified with 
the direction of the flow in the associated channel. The graph is hard codified 
as opposed to Adleman’s soft /DNA codification. One microfluidic system of this 
type with 2n chambers can be reprogrammed (open/close microchannels) to 
codify the edges of any other graph with n vertices. 

Implementation. It is beyond the scope of this paper to give details of the 
possible implementation of these microsystems. We merely indicate that filter- 
ing and append operations are widely used in DNA computing and a detailed 
description of microfluidic devices is given in [11,20]. 

Working (dynamics) of the system. We assume that filtering and append 
operations take one time unit and that in each time unit, the output of each Fi 
is available as input to each Ci. 
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— Step 0: {t = 0) {pre-loading). Put into each chamber Fj, enough copies of the 
strands associated with vertices Vi for which there exists an edge {i,j) G A. 

— Steps 1 to (n — 1): (from t = 1 to t = (n — 1)). 

• Computations: In each time unit, for all 1 < t < n, in parallel a filtering 
in all Fi and an append operation in all Ci is performed. 

• Movement (pumping) of strands from step t to step t+1. Input(F)|,j^^j ) = 
IJ Output (Cjjjj ) for all (t,j) G A. The inlet of each chamber Fj in time 
t + 1 is the union of the outlets of chambers Ci in time t such that 
(z, j) G A. 

— Readout: After step (n— 1) collects the strands in the output of chambers Ci. 
If there are no output strands, the graph has no Hamiltonian paths. If there 
are output strands in chamber Ci, there are Hamiltonian paths ending with 
vertex vp, we need yet to determine the initial vertex of the strands in the 
output of each chamber Ci. The initial vertex of each Hamiltonian path and 
the exact sequence of vertices could be determined with a DNA sequencing 
microchip. 

2.3 Example 

We show an example of the execution of the algorithm for a graph with four ver- 
tices. This graph has 3 Hamiltonian paths: 2143, 2134 and 1234. Figure 1 shows 
the graph and the associated microfluidic device. Table 1 shows the contents of 
each chamber in each step. Remember that Input(F)^^^,^j ) = (J Output ) for 
all (z, j) G A. In step 3 (time t = 3 = n— 1), we must check if there is any output 
in any chamber Ci. In this case, for this graph, there are outputs in chambers 
C^ and C 4 so we know there are Hamiltonian paths ending at vertices U 3 and V 4 
respectively. To find out the first vertex and the exact order of the other vertices 
in each path, we need to sequence these strands: paths 2143, 2134, and 1234 are 
found. 



t = 1 


t = 2 


t = 3 


t = 4 


Fi = {2} 


Fi = {12} 


Fi = {-} 


Fi = {-} 


Cl = {21} 


Cl = {-} 


Cl = {-} 


Cl = {-} 


F 2 = {1} 


F 2 = {21} 


F 2 = {-} 


F 2 = {-} 


O 2 = {12} 


C 2 = {-} 


C 2 = {-} 


C 2 = {-} 


As = {1,2, 4} 


As = {21, 12} 


Fi = {214, 134, 234} 


Fi = {2134, 1234} 


O 3 = {13,23,43} 


Ci = {213, 123} 


Ci = { 2143 } 


C 3 = {-} 


F4 = {1,3} 


Fi = {21,13,23,43} 


Fi = {213, 123, 143} 


Fi = {2143} 


O 4 = {14, 34} 


O 4 = {214, 134, 234} 


O 4 = { 2134 , 1234 } 


C 4 = {-} 



Table 1. Running of the algorithm. In time t = 3 we must readout the output 
of the chambers Ci. In bold, strands codifying Hamiltonian paths. 
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1 2 





b) 



Fig. 1. a) Graph with 4 vertices and b) microfluidic system associated with this 
graph. 



3 Final Remarks 

This article describes a preliminary work on a DNA algorithm for HPP, which, 
like many others in the held of DNA computing (see [1,14]) has a linear execution 
time but an exponential consumption of DNA molecules. However, this algorithm 
runs autonomously in parallel without manual external intervention and does not 
include complex operations with high error rates (PCR, electrophoresis, enzyme 
restrictions, etc.). Additionally, we believe that it could be easily implemented in 
microfluidic systems bearing in mind the actual state of this technology. Another 
notable feature of the proposed bioalgorithm is that it follows a constructive 
problem-solving strategy (like [18] but without complex operations), pruning the 
unfeasible solutions (it only generates paths that do not have repeated vertices) . 
The details of a possible implementation of this algorithm will be examined in 
a future paper. 

However, we believe that other problem types (like sequence problems or 
so-called DNA2DNA computation) are more appropriate and of greater biolog- 
ical interest. With this in mind, we are working on designing an autonomous 
and distributed DNA algorithm for solving the Shortest Common Superstring 
(known NP-complete problem only solved in the DNA computing held in [12] 
with a brute-force approach) using a conditional version of the parallel overlap- 
ping assembly operation. 

Another variant of the algorithm proposed here for the HPP can be used 
to generate in linear time all the permutations of the n integers l,2,...,n. 
Each number i is codified with a different single strand and has two associated 
chambers Fi and Ci. The pattern of connectivity is a complete graph, so, from 
every chamber Ci, there is a channel to every chamber Fj, for all 1 < f, j < n. In 
n—l steps, the microsystem generates DNA strands codifying all permutations. 
Again, our system generates only the correct strings because the algorithm fails 
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to extend strings with the same integer in at least two locations. This problem 
is presented in [2] as a useful subprocedure for the DNA solution of several 
NP-complete problems. 

This paper also poses theoretical questions, as the parallel and distributed 
computing model on which it relies, as well as the operations it uses — conditional 
concatenation or filtering plus concatenation — are already present in theoretical 
distributed DNA computing models, like test tubes distributed systems, gram- 
mar systems theory, parallel finite automata systems and also tissue P systems. 

In particular, the redistribution of the contents of the test tubes according to 
specific filtering conditions (presence or absence of given subsequences) is present 
in the Test Tube Distributed Systems model [7,8] and, similarly, in Parallel 
Communicating (PC) Automata Systems [4] or [19]. However, our algorithm 
defines no master chamber. It also differs from PC grammar systems [6] because 
communication is not performed on request but is enacted automatically. 

The generative power of this new system with distributed conditional con- 
catenation remains to be examined. Only a few preliminary comments. In the 
HPP algorithm we use a very simple conditional concatenation: a symbol a is 
concatenated to a sequence w if and only if a ^ w. In a similar way, we could 
extend the restrictions to check the presence of specific suffixes, prefixes or sub- 
words as in [9]. This distributed conditional concatenation is a combination of 
restricted evolution and communication (append restricted by filtering) that im- 
plicitly assumes the existence of regular/finite filters between the concatenation 
chambers. A similar study [10] was conducted for the test tube systems model 
but with context-free productions and regular filters. Looking at these results, 
we think our model probably needs a more powerful operation than conditional 
concatenation to be able to achieve computational completeness (like splicing 
[13] in the grammar systems area). 

Microfluidic systems seem to be an interesting and promising future support 
for many distributed DNA computing models (shift from Adleman/Lipton man- 
ual wet test tubes to DNA computing on surfaces to microflow DNA computing), 
and its full potential (the underlying computational paths could be a graph with 
cycles that will allow, for example, the iterative construction and selection of the 
solutions) needs to be thoroughly examined. 
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Abstract. We consider two types of languages defined by a string 
through iterative factor duplications, inspired by the process of tandem 
repeats production in the evolution of DNA. We investigate some de- 
cidability matters concerning the unbounded duplication languages and 
then fix the place of bounded duplication languages in the Chomsky hier- 
archy by showing that all these languages are context-free. We give some 
conditions for the non-regularity of these languages. Finally, we discuss 
some open problems and directions for further research. 



1 Introduction 

In the last years there have been introduced some operations and generat- 
ing devices based on duplication operations, motivated by considerations from 
molecular genetics. It is widely accepted that DNA and RNA structures may 
be viewed to a certain extent as strings; for instance, a DNA strand can be 
presented as a string over the alphabet of the complementary pairs of symbols 
(A, T), (T, A), (C, G), (G, G). Consequently, point mutations as well as large scale 
rearrangements occurring in the evolution of genomes may be modeled as oper- 
ations on strings. 

One of the most frequent and less well understood mutations among the 
genome rearrangements is the gene duplication or the duplication of a segment 
of a chromosome. Chromosomal rearrangements include pericentric and para- 
centric inversions, intrachromosomal as well as interchromosomal transpositions. 
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translocations, etc. Crossover results in recombination of genes in a pair of ho- 
mologous chromosomes by exchanging segments between parental chromatides. 
We refer to [3], [4], [5] and [19] for discussions on different formal operations on 
strings related to the language of nucleic acids. This feature is also known in 
natural languages. For motivations coming from linguistics, we refer to [11] and 
[17]. 

In the process of duplication, a stretch of DNA is duplicated to produce two 
or more adjacent copies, resulting in a tandem repeat. An interesting property 
of tandem repeats is that they make it possible to do “phylogenetic analysis” on 
a single sequence which might be useful to determine a minimal or most likely 
duplication history. 

Several mathematical models have been proposed for the production of tan- 
dem repeats including replication, slippage and unequal crossing over [10,24,18]. 
These models have been supported by biological studies [20,?]. 

The so-called crossing over between “sister” chromatides is considered to be 
the main way of producing tandem repeats or block deletions in chromosomes. In 
[2], modeling and simulation suggests that very low recombination rates (unequal 
crossing over) can result in very large copy number and higher order repeats. 
A model of this type of crossing over has been considered in [6]. It is assumed 
that every initial string is replicated so that two identical copies of every initial 
string are available. The first copy is cut somewhere within it, say between the 
segments a and (3, and the other one is cut between 7 and 5 (see Figure 1). Now, 
the last segment of the second string gets attached to the first segment of the 
first string, and a new string is obtained. More generally, another string is also 
generated, by linking the first segment of the second string to the last segment 
of the first string. 



X a /3 y 



w 




Figure 1: A scheme for gene duplication 



It is easily seen that one obtains the insertion of a substring of w in w; this has 
the potential for inducing duplications of genes within a chromosome. We note 
here that, despite this model is inspired from recombination in vivo, it actually 
makes use of splicing rules in the sense of [8], where a computational model based 
on the DNA recombination under the influence of restriction enzymes and ligases 
essentially in vitro is introduced. This model turned out to be very attractive 
for computer scientists, see, e.g., the chapter [9] in [16] and [15]. 
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Based on [3], Martm-Vide and Paun introduced in [12] a generative mecha- 
nism (similar to the one considered in [4]) based only on duplication: one starts 
with a given finite set of strings and produces new strings by copying specified 
substrings to certain places in a string, according to a finite set of duplication 
rules. This mechanism is studied in [12] from the generative power point of view. 
In [14] one considers the context-free versions of duplication grammars, solves 
some problems left open in [12], proves new results concerning the generative 
power of context-sensitive and context-free duplication grammars, and compares 
the two classes of grammars. Context-free duplication grammars formalize the 
hypothesis that duplications appear more or less at random within the genome 
in the course of its evolution. 

In [7] one considers a string and constructs the language obtained by iter- 
atively duplicating any of its substrings. One proves that when starting from 
strings over two-letter alphabets, the obtained languages are regular; an answer 
for the case of arbitrary alphabets is given in [13], where it is proved that each 
string over a three-letter alphabet generates a non-regular language by duplica- 
tion. 

This paper continues this line of investigation. Many questions are still un- 
solved; we list some of them, which appear more attractive to us - some of then 
will be investigated in this work: 

- Is the boundary of the duplication unique, is it confined to a few locations 
or it is seemingly unrestricted? 

- Is the duplication unit size unique, does it vary in a small range or is it 
unrestricted? 

- Does pattern size affect the variability of duplication unit size? 

- Does duplication occur preferentially at certain sites? 

In [7] the duplication unit size is considered to be unrestricted. We continue 
here with a few properties of the languages defined by unbounded duplication 
unit size and then investigate the effect of restricting this size within a given 
range. 

The paper is organized as follows: in the next section we give the basic 
definitions and notations used throughout the paper. Then we present some 
properties of the unbounded duplication languages based essentially on [7,13]. 
The fourth section is dedicated to bounded duplication languages. The main 
results of this section are: 1. Each bounded duplication language is context-free. 
2. Any square-free word over an at least three-letter alphabet defines a /c-bounded 
duplication languages which is not regular for any fc > 4. The papers ends with 
a discussion on several open problems and directions for further research. 

2 Preliminaries 

Now, we give the basic notions and notations needed in the sequel. For basic for- 
mal language theory we refer to [15] or [16]. We use the following basic notation. 
For sets X and Y, X \ Y denotes the set-theoretic difference of X and Y. If X 
is finite, then card{X) denotes its cardinality; 0 denotes the empty set. The set 
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of all strings (words) over an alphabet V is denoted by V* and = V* \ {e}, 
where e denotes the empty string. The length of a string x is denoted by |a;|, 
hence jej = 0, while the number of all occurrences of a letter a in a; is denoted 
by |a;|a • For an alphabet V = {oi, 02 , . • . , Ofe} (we consider an ordering on V), 
the Parikh mapping associated with V is a homomorphism !Fy from V* into 
the monoid of vector addition on IN^, defined by >Fy(s) = (|s|ai, |s|a 2 , ■ • ■ , |s|aj.); 
moreover, given a language L over V, we define its image through the Parikh 
mapping as the set Wy^L) = {if'y(x) | x € L}. A subset X of IN^ is said to be 
linear if there are the vectors cq, ci, C 2 , . • . , c„ € IN^, for some n > 0 such that 
A = {cq + X)r=i I 2^1 G IN, 1 < z < n}. A finite union of linear sets is called 
semilinear. For any positive integer n we write [rz] for the set {1, 2, . . . , n}. 

Let V be an alphabet and X G {IN} U {[/c] | k > 1}. For a string w G V~^, we 
set 

Dx{w) = {uxxv I w = uxVjUjV G V*,x G V~^ , |x| G X}. 

We now define recursively the languages: 

Dx(w) = {ic}, Dx(w) = y Dx{x), i > 1, 

(w) 

D*xH = \J^xH- 

i>0 

The languages D^{w) and k > 1, are called the unbounded duplication 

language and the k-bounded duplication language, respectively, defined by w. In 
other words, for any X G {IN} U {[fc] | fc > 1}, D*x{w) is the smallest language 
L' C V* such that w G L' and whenever uxv G L', uxxv G L' holds for all 
u,v G V*, X G V~^, and \x\ G X. 

A natural question concerns the place of unbounded duplication languages 
in the Chomsky hierarchy. In [7] it is shown that the unbounded duplication 
language defined by any word over a two-letter alphabet is regular, while [13] 
shows that these are the only cases when the unbounded language defined by a 
word is regular. By combining these results we have: 

Theorem 1. [7,13] The unbounded duplication language defined by a word w is 
regular if and only if w contains at most two different letters. 

3 Unbounded Duplication Languages 

We do not know whether or not all unbounded duplication languages are context- 
free. A straightforward observation leads to the fact that all these languages are 
linear sets, that is, the image of each unbounded duplication language through 
the Parikh mapping is linear. Indeed, if zc G V~^, V = {oi, 02 , . . . , a„}, then one 
can easily infer that 

n 

^v{D^{w)) = + X] I € IN, 1 < i < n}, 

2=1 
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(n) 

where ' is the vector of size n having the zth entry equal to 1 and all the other 

entries equal to 0. 

Theorem 2. Given a regular language L one can algorithmically decide whether 

or not L is an unbounded duplication language. 

Proof. We denote by alph{x) the smallest alphabet such that x G (alph{x))* . 

The algorithm works as follows: 

(i) We find the shortest string z G L (this can be done algorithmically) . If there 
are more strings in L of the same length as z, then L is not an unbounded 
duplication language. 

(ii) We now compute the cardinality of alph(z). 

(iii) If card{alph{z)) > 3, then there is no x such that L = 

(iv) If card{alph{z)) = I, then L is an unbounded duplication language if and 

only if L = I m > 0}, where alph{z) = a. 

(v) If fc = 2, z = Z 1 Z 2 . . . Zn, Zi G alph{z), 1 < i < n, then L is an unbounded 
duplication language if and only if 

L = z+eiZ2C2 . . .e„_iz+, ( 1 ) 



where 

a= I ^ 

\ j if ^ 

for all 1 < z < n — 1. Note that one can easily construct a deterministic finite 
automaton recognizing the language in the right-hand side of equation (1). 

□ 

Theorem 3. 1. The following problems are algorithmically decidable for un- 
bounded duplication languages: 

Membership: Given x and y, is x in D^{y)? 

Inclusion : Given x and y, does D^{x) C D^{y) hold? 

2. The following problems are algorithmically decidable in linear time: 
Equivalence: Given x and y, does D^(x) = D^{y) hold? 

Regularity: Given x, is D^{x) a regular language? 

Proof. Clearly, the membership problem is decidable and 

^ ^hiv) iff a: G D^{y). 

For D^{x) = D^{y), it follows that |x| = \y\, hence x = y. In conclusion, x = y 
iff D^{x) = D^{y). This implies that the equivalence problem is decidable in 
linear time. 

The regularity can be decided in linear time by Theorem 1. □ 
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4 Bounded Duplication Languages 



Unlike the case of unbounded duplication languages, we are able to determine 
the place of bounded duplication languages in the Chomsky hierarchy. This is 
the main result of this section. 



Theorem 4. For any word r and any integer n > 1, the n-hounded duplication 
language defined by r is context-free. 

Proof. For our alphabet V = alph{r) we define the extended alphabet T4 by 
V(. \= V yj {{a) \ a € V}. Further we define := {w € L \ |rt;| < /} for any 
language L and integer 1. We now define the pushdown automaton 



[Q,V,r,5, 



,4, 



where Q = 



and r = {_L} U < 



M G {Vf • U* U U* • V G w G (U*) 

’ M 



<M 






gQ,uG (U*)^",u>G (U*) 



i<kl 



Here we call the three strings occurring in a state from bottom to top pat- 
tern, memory, and guess, respectively. Now we proceed to define an intermedi- 
ate deterministic transition function 5' . In this definition the following variables 
are always quantified universally over the following domains: u,v G (U*)-", 
w G PL G r, G (Vf • U* UU* • 7 G F, a: G U and y G Ue. 



(i) y 

(ii) y 
(hi) y 

(iv) y 



xw 

flXU 



,x,±j = 

,a;,7) = 



u{x) Y 



fi{x) 



_L I and S' 



XU 

xw 



6' 



■l{x) 



j = 

u{x) Y fii~\ 

XV I 



, 7 ) and S' 

uxY 



,e,-Lj = 

I = 



(IXU 

XV 



,4 

ft{x) 



,X,f] = 



, 7 j and 

uxY {I 






'n 

ux 



■l{x) 



= 



V 

uxv 



For all triples {q, x, 7 ) G Q x (U Uje}) x F not listed above, we put 5'{q, x, 7 ) = 0. 

To ensure that our finite state set suffices, we take a closer look at the memory 
of the states - since after every reduction the reduced word is put there (see 
transition set (iv)), there is a danger of this being unbounded. However, during 
any reduction of a duplication, which in the end puts k < n letters into the 
memory, either 2k letters from the memory or all letters of the memory (provided 
the memory is shorter than 2k) have been read, since reading from memory has 
priority (note that the tape is read in states with empty memory only). This 
gives us a bound on the length of words in the memory which is n. It is worth 
noting that the transitions of S' actually match either the original word r or the 
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guess against the input or memory. To obtain the transition function S of our 
automaton, we add the possibility to interrupt in any point the computation of 
S' and change into a state that starts the reduction of another duplication. To 

€ Q\F, and 7 S T 



this end we define for all 



S 





= S' 





U 





I G (^+) 




Such a transition guesses that at the current position a duplication of the 
form zz can be reduced (note that also here the bound of the length of the 
memory is not violated) . For the sake of understandability we first describe also 
the function of the sets of transitions of S' . Transitions (i) match the input word 
and r; this is only done on empty guess and stack, which ensures that every letter 
is matched only after all duplications affecting it have been reduced. Sets (ii) and 
(iii) check whether the guess, which is the segment of a guessed duplication, does 
indeed occur twice adjacently in the memory followed by the input. This is done 
by converting first guess letters from letters in V to the corresponding letters in 
Ve (set (ii)) exactly if the respective letter is read from either memory or input. 
Then set (iii) recovers the original letters. Finally the transitions in (iv) check 
the last letter; if it also matches, then the duplication is reduced by putting only 
one copy (two have been read) of the guess in the memory, and the computation 
is continued by putting the string encoded in the topmost stack symbol back 
into the guess. Now we can state an important property of A„ which allows us 
to conclude that it accepts exactly the language DF(r). 



Property. If there is an accepting computation in A'f starting from the config- 
,v,a , then there is also an accepting computation starting from 



uration 
any configuration 



,V2,a where V = V1V2, \vi\ < n. 



Proof of the property. The statement is trivially true for vi = e, therefore in the 
rest of the proof we consider vi yf s. We prove the statement by induction on the 
length of the computation. First we note that there exists an unique accepting 



configuration that is 




Let n be an accepting computation in A'^ 



starting from the configuration 



If the length of II is one, then 77 = £, a = T, and v = w G V which makes 
the statement obviously true. Let us assume that the statement is true for any 
computation of length at most p and consider a computation U of length p+1 
where we emphasize the first step. We distinguish three cases: 



V 

£ 

W 




h 




, V, rja 



h* 





Case 1. 




304 Peter Leupold, Victor Mitrana, and Jose M. Sempere 



Clearly, we have also 



,V2,a h 



V2,ricx forany V1W2 = v, |vi| < n. 



By the induction hypothesis, 
Case 2. 

/r „i 

, xv', a h 



,V2,T)a h 



, e, _L ) holds as well. 



,v',a h 



, £, _L 



where v = xv' and the first step is based on a transition in the first part of the 
sets (i-iii). We have also 



,V2,a] h 



,V 2 ,a based on one of the 



corresponding transitions in the second part of the sets (i-iii). Since Vi = xv'i, 
by the induction hypothesis we are done in the second case. 

Case 3. 



v 

£ 

W 




h* 




where the first step is based on a transition in the first part of the set (iv) . Since 
reading from memory has priority, there exist w' , r, and [3 such that 



, v' , a' 



,u',/3) h 



, £, _L 





r/' 


\ 


/ 


r <^1 


( 


xv'-^ 

w 


,V2,aa 1 


^ ( 


1 



We have 



Further, following the same transitions as above, we get 



,V2,a' for any V1V2 = v, |ui| < n. 



yvi 



,V2,a' h 



V 2 , (3 j, and by the induction hypothesis we conclude the proof of the 
third case, and thus the proof of the property. 



After the elaborate definition of our pushdown automaton, the proof for 
L{A'^) = by induction on the number of duplications used to create 

a word in (r) is now rather straightforward. We start by noting that the 
original word r is obviously accepted. Now we suppose that all words reached 
from r by m — 1 duplications are also accepted and look at a word s reached by m 
duplications. Clearly there is a word s' reached from r by m — 1 duplications such 
that s is the result of duplicating one part of s'. By the induction hypothesis, 
s' is accepted by A'^, which means that there is a computation, we call it S, 
which accepts s'. Let s'[l . . .k] (the subword of s' which starts at the position I 
and ends at the position Z in s', ? < fc) be the segment of s' which is duplicated 
at the final stage of producing s. Therefore 



s = s'[l . . . k]s'[l . . . |s'|] = s'[l ...I- l]s'[l . . . k]s'[l . . . k]s'[k + 1 . . . |s'|]. 
Because reads in an accepting computation every input letter exactly once, 
there is exactly one step in E, where s'[l] is read. Let this happen in a state 
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with s'[/ . . . |s'|] = s[fc + 1 . . . |s|] left on the input tape and a the stack contents. 



Now we go without reading the input tape to state 



s' [Z . . . fc] 
e 
w 



and push /i onto 



the stack. Then we reduce the duplication of the subword s'[l . . .k] of s, which is 
the guess of the current state, by matching it twice against the letters read on the 
input tape. After reducing this duplication, we arrive at a configuration having 
a state containing fi in the guess, s'[l . . . fc] in the memory, and w in the pattern, 
s[fc + 1 . . . |s| left on the tape, and the stack contents as before. By the property 
above, there exists an accepting computation starting with this configuration, 
hence s is accepted. Since all words accepted by are clearly in (r), and 
we are done. □ 

Clearly, any language (w) is regular. The same is true for any language 
Indeed, it is an easy exercise (we leave it to the reader) to check that 



= (rc[l]+u>[2]+)*(ri;[2]+u>[3]+)* . . . (u;[|u; - l|]+w[|u;|] + )*. 

The following question appears in a natural way: Which are the minimal k and 
n such that there are words w over an n-letter alphabet such that (w) is not 
regular? 



Theorem 5. 1. is always regular for any word w over a two-letter al- 

phabet and any k>l. 

2. For any word w of the form w = xabcy with a ^ b ^ c ^ a, is not 

regular for any fc > 4. 

Proof. 1. The equality D^{w) = holds for any k > 2 and any word w 

over a two-letter alphabet, hence the first item is proved. 

2. Our reasoning for proving the second item, restricted without loss of gen- 
erality to the word w = abc, is based on a similar idea to that used in [13]. So, 
let w = abc, V = {a, 6, c}, and fc > 4. First we prove that for any u G such 
that wu is square-free, there exists v G V* such that wuv G We give 

a recursive method for constructing wuv starting from w; at any moment we 
have a string wu'v' where u' is a prefix of u and v' is a suffix of v. Initially, 
the current string is w which satisfies these conditions. Let us assume that we 
reached x = rt;u[l]u[2] . . ,u[i — l]w' and we want to get y = wm[1]m[2] . . .u[i]v" . 
To this end, we duplicate the shortest factor of x which begins with u[i] and 
ends on the position 2 -|- z in a;. Because wu[l]u[2] . . .u[z — 1] is square-free and 
any factor of a square-free word is square-free as well, the length of this factor 
which is to be duplicated is at most 4. Now we note that, given u such that wu 
is square free and v is the shortest word such that wuv G D^f,.^{w), we have on 
the one hand k\u\ -\- 3 > jwuwj (each duplication produces at least one symbol 
of u), and on the other hand {k— l)|f| > juj (each duplication produces at least 
one symbol of v since wu is always square-free). Therefore, 

(A:-l)H>|z;|>^ 



(2) 
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We are ready now to prove that is not regular using the Myhill-Nerode 

characterization of regular languages. We construct an infinite sequence of square- 
free words w\,W 2 t ■ ■ „ each of them being in a different equivalence class: 

Irtl 

w\ = w and Wi+\ = wu such that wu is square-free and (k — l)|tcd < 

k — 1 

Clearly, we can construct such an infinite sequence of square-free words since 
there are arbitrarily long square-free words having the prefix abc [21,22,1]. For 
instance, all words h^{a), n > 1, are square-free and begin with abc, where h is 
an endomorphism on V* defined by h{a) = abcab, h{b) = acacb, h{c) = acbcacb. 
Let Vi be the shortest word such that WiVi £ (w), z > 1. By relation (2), 

Wi+iVj ^ for any 1 < j < z. Consequently, is not regular. □ 

Since each square-free word over an at least three-letter alphabet has the 
form required by the previous theorem, the next corollary directly follows. 

Corollary 1. Zlj‘^j(zc) is not regular for any square-free word w over an alphabet 
of at least three letters and any A: > 4. 

5 Open Problems and Further Work 

We list here some open problems which will be in our focus of interest in the 
near future: 

1. Is any unbounded duplication language context-free? A way for attacking 

this question, in the aim of an affirmative answer, could be to prove that for 
any word w there exists a natural k^ such that (zc). Note that a 

similar result holds for words w over alphabets with at most two letters. 

2. What is the complexity of the membership problem for unbounded du- 
plication languages? Does the particular case when y is square-free make any 
difference? 

3. We define the X- duplication distance between two strings x and y, denoted 
by Dupdx{x,y), as follows: 

Dupdx{x,y) = min{A: | x £ D\(y) or y £ D\{x)},X £ {IN} U {[rz] | n > 2}. 

Clearly, Dupd is a distance. Are there polynomial algorithms for computing 
this distance? What about the particular case when one of the input strings is 
square-free? 

4. An X-duplication root, X £ {IN} U {[rz] | n > 2}, of a string x is an X- 
square-free string y such that x £ D*^{y). A string y is A-square-free if it does 
not contain any factor of the form zz with jzj £ X. It is known that there are 
words having more than one IN-duplication root. Then the following question is 
natural: If x and y have a common duplication root and X is as above, then 
D*^{x) n D*^{y) yf 0? We strongly suspect an affirmative answer. Again, the 
case of at most two-letter alphabets is quite simple: Assume that x and y are 
two strings over the alphabet V = {a, b} which have the same duplication root. 
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say aba (the other cases are identical). Then x = a^^V’^a^'^ . . and 

y = . . .a^"^b‘^”'a^"^+^ for some n,m > 1. It is an easy exercise to get 

by duplication starting from x and y, respectively, the string {a’‘b’‘Ya^, where 
s = max(n, m) and i = max(A), with 

A = {kt I l<t<n+l}U {pt I 1 < t < n} 

u {ji I 1 < ^ + 1} u I 1 ^ ^ ^ 

What is the maximum number of Ai-duplication roots of a string? How hard 
is it to compute this number? Given n and X G {IN} U |[n] | n > 2}, are there 
words having more than n X-duplication roots? (the non-triviality property) 
Given n, are there words having exactly n X-duplication roots? (the connec- 
tivity property) Going further, one can define the ^-duplication root of a given 
language. What kind of language is the AT-duplication root of a regular language? 
Glearly, if it is infinite, then it is not context-free. 
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Abstract. We present a new proof that languages generated by (non 
extended) H systems with Hnite sets of axioms and rules are regular. 



1 Introduction 

Splicing is the basic combinatorial operation which DNA Computing [8] is based 
on. It was introduced in [4] as a formal representation of DNA recombinant 
behavior and opened new perspectives in the combinatorial analysis of strings, 
languages, grammars, and automata. Indeed, biochemical interpretations were 
found for concepts and results in formal language theory [11,8] and Molecular 
Computing emerged as a new field covering these subjects, where synergy be- 
tween Mathematics, Computer Science and Biology yields an exceptional stimu- 
lus for developing new theories and applications based on discrete formalizations 
of biological processes. In this paper we express the combinatorial form of splic- 
ing by four cooperating combinatorial rules: two cut rules (suffix and prefix dele- 
tion), one paste rule, and one (internal) deletion rule. This natural translation 
of splicing allows us to prove in a new manner the regularity of (non extended) 
splicing with finite sets of axioms and of rules. 

2 Preliminaries 

Consider an alphabet V and two symbols #,$ not in V. A splicing rule over V 
is a string r = ui#U 2$M3#U4, where ui,U2, U3, U4 € V*. For a rule r and for any 
xi,X2,yi,y2 G V* we define the (ternary) splicing relation such that: 

X 1 U 1 U 2 X 2 , yiU3U4y2 =^r XiUiU4y2- 

In this case we say that x\UiU4y2 is obtained by a splicing step, according to 
the rule r, from the left argument X1U1U2X2 and the right argument y\U3U4y2- 
The strings U1U2 and U3U4 are respectively the left and right splicing points or 
splicing sites of the rule r. An H system, according to [8], can be defined by a 
structure F = (V, A, R) where V is an alphabet, that is, a finite set of elements 
called symbols, A is a set of strings over this alphabet, called axioms of the 
system, and i? is a set of splicing rules over this alphabet. 
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Lo{r) consists of the axioms of F. For n > 0, Ln+i{F) consists of Ln{F) as 
well as the strings generated by one splicing step from strings of Ln{F), by apply- 
ing all the rules in R that can be applied. The language L{F) generated by F is 
the union of all languages Ln{F) for n > 0. If a terminal alphabet T C V is con- 
sidered, and L{F) consists of the strings over T* generated by F, then we obtain 
an extended H system F = (V, T, A, i?). H systems are usually classified by means 
of two classes of languages FLi, FL 2 : a H system is of type FL 2 ) when 

its axioms form a language in the class FLi and its rules, which are strings of 
(yu{#, $})*, form a language in the class FL 2 ] EFl{FLi,FL 2 ) is the subtype of 
extended H systems of type FL 2 ). We identify a type C = FI{FLi,FL 2 ) 

of F[ systems with the class of languages generated by C. Let FIN, REG, RE 
indicate the classes of finite, regular, and recursively enumerable languages re- 
spectively. It is known that: H{EIN,EIN) C REG, H{REG,EIN) = REG, 
EH{FIN, FIN) = REG, EH{EIN, REG) = RE. Comprehensive details can 
be found in [8]. We refer to [11] and [8] for definitions and notations in formal 
language theory. 

3 Cut-and-Paste Splicing 

A most important mathematical property of splicing is that the class of languages 
generated by a finite splicing, that is, by a finite number of splicing rules, from 
a finite initial set of strings, is a subclass of regular languages: H{FIN, FIN) C 
REG and, more generally, H{REG, EIN) = REG. The proof of this result has a 
long history. It originates in [1,2] and was developed in [9], in terms of a complex 
inductive construction of a finite automaton. In [8] Pixton’s proof is presented 
(referred to as Regularity preserving Lemma). More general proofs, in terms of 
closure properties of abstract families of languages, are given in [5,10]. In [6] a 
direct proof was obtained by using w-splicing. In this section we give another and 
more direct proof of this lemma, as a natural consequence of a representation of 
splicing rules. 

Let T be a H system of alphabet V, with a finite number of splicing rules. 
The language L{F) generated by a H system F can be obtained in the following 
way. Replace every splicing rule 



r* = Mi#M2$U3#M4 



of F by the following four rules (two cut rules, a paste rule, a deletion rule) where 
•i and ©i are symbols that do not belong to V, and variables x\,X 2 ,yi,y 2 ,x,y, 
w, z range over the strings on V . 



1 . X1U1U2X2 =^n xiUi^i 

2 . yiU3U4y2 =^n &iU4V2 

3. xui»i, QtU4y =^n xu\ »i QtU4y 

4. w QiZ =^n wz 



right cut rule 
left cut rule 
paste rule 
deletion rule 
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If we apply these rules in all possible ways, starting from the axioms of F, 
then the set of strings so generated that also belong to V* and coincides with 
the language L{F). 

This representation of splicing has a very natural biochemical reading [3]. 
Actually, the first two rules are an abstract formulation of the action of restric- 
tion enzymes, where the symbols and correspond to the sticky ends (here 
complementarity is between and ©i). The third rule is essentially the an- 
nealing process that joins strands with matching sticky ends but leaves a hole 
(represented by the string *i©i) in the fosphodiesteric bond between the 5’ and 
3’ loci. The final rules express the hole repair performed by a ligase enzyme. The 
proof that will follow is inspired by this analysis of the splicing mechanism and 
develops an informal idea already considered in [7] . 

Theorem 1. If F is an H system with a finite number of axioms and rules, then 
L{F) is regular. 

Proof. Let ri,r2,...,r„ be the rules of F. For each rule = uiffu2%u^ffui, 
introduce two new symbols and ©,, for i = 1, . . .n, which we call bullet and 
antibullet of the rule ri. If U\U2 and U3M4 are the left and the right splicing sites 
of Ti, then symbols and ©i can be used in order to highlight the left and right 
splicing sites that occur in a string. More formally, let h be the morphism 

/i : (y U I f = 1, . . . n} U {©i I z = 1, . . . n})* ^ V* 

that coincides with the identity on V and erases all the symbols that do not 
belong to V, that is, associates the empty string A to them. In the following 
V will abbreviate (R U | i = 1, . . .n} U {©j | z = 1, . . .n})*. For any rule 
ri = zzi#t62$zt3#U4, with z = 1, . . . , zz, we say that a string of V is •i-factorizable 
if it includes a substring u\u2 G V such that h{v!f) = u\, /i(zz^ = U2. In this 
case the relative •j-factorization of the string is obtained by replacing zzj^zz^ with 
u[ u'2. Analogously, a string of V is ©j-factorizable if it includes a substring 
u'^u'^ G V' such that h{u'^) = U3, h{u'^) = U4. In this case the relative ©j- 
factorization of the string is obtained by replacing u'^u'4 with u'^Qiu'4. A string a 
is a maximal factorization of a string zy if a is a factorization such that h{a) = rj, 
a contains no two consecutive occurrences of the same bullet or antibullet, while 
any further factorization of a contains two consecutive occurrences of the same 
bullet or antibullet. It is easy to verify that the maximal factorization of a string 
is unique. 

Now, given an H system, we factorize its axioms in a maximal way. Let 
oi, 02, • ■ • ) cxm be these factorizations and let i>ai>, c>a2t>, ■ • ■ , i>Om> be their ex- 
tensions with a symbol marking the start and the end of these factorizations. 
From the set <1 of these factorization strings we construct the following labeled 
directed graph Go, which we call axiom factorization graph of the system, where: 

1. A node is associated to each occurrence of a bullet or antibullet that occurs 
in a strings of <P, while a unique entering node is associated to all the 
occurrences of symbol c> at beginning of the strings of F, and a unique exiting 
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node *> is associated to all the occurrences of symbol o at end of the strings 
of 

2. From any node n there is an arc to the node m if the occurrence to which 
m is associated is immediately after the occurrence to which n is associated; 
this arc is labeled with the string between these two occurrences. 

3. Each bullet is linked by an arc with the empty label to the antibullet of the 
same index. 

As an example, let us apply this procedure to the following H system speci- 
fied by: 

Alphabet: {a,b,c,d}, 

Axioms: {dbab , cc}, 

Rules: {r\ : a^b$X^ab ,T 2 '■ &aa#aaa$A#c}. 




Fig. 1. An Axiom Factorization Graph 



In Figure 1 the graph Go of our example is depicted. Hereafter, unless it is 
differently specified, by a path we understand a path going from the entering 
node to the exiting node. If one concatenates the labels along a path, one gets 
a string generated by the given H system. However, there are strings generated 
by the system that are not generated as (concatenation of labels of) paths of 
this graph. For example, the strings dbaac, dbaacc do not correspond to any path 
of the graph above. In order to overcome this inadequacy, we extend the graph 
that factorizes the axioms by adding new possibilities of factorizations, where 
the strings ui and U 4 of the rules are included and other paths with these labels 
are possible. 

Before going on, let us sketch the main intuition underlying the proof, that 
should appear completely clear when the formal details are developed. The axiom 
factorization graph suggests that all the strings generated by a given H system 
are combinations of: i) strings that are labels of the axiom factorization graph, 
and ii) strings ui and U 4 of splicing rules of the system. Some combinations of 
these pieces can be iterated, but, although paths may include cycles, there is only 
a finite number of ways to combine these pieces in paths going from the entering 
node to the exiting node. An upper bound on this number is determined by: i) 
the number of substrings of the axioms and of the rules, and ii) the number of 
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factorization nodes that can be inserted in a factorization graph when extending 
the axiom factorization with the rules, as it will be shown in the proof. 

The detailed construction of the proof is based on two procedures that expand 
the axiom factorization graph Gq. We call the first procedure Rule expansion 
and the second one Cross bullet expansion. In the following, for the sake of 
brevity, we identify paths with factorization strings. 

3.1 Rule Expansion 

Let Go be the graph of the maximal factorizations of the axioms of T. Starting 
from Go, we generate a new graph Gg which we call rule expansion of Gq. 
Consider a symbol ® which we call cross bullet. For this symbol h is assumed 
to be a deleting function, that is, /i(C>) = A. For every rule rj = 
of r, add to Go two rule components: the ui component that consists of a pair 
of nodes 0 and with an arc from (g) to labeled by the string u\, and the 
U 4 component that consists of a pair of nodes (g), ©i with an arc from 0 i to © 
labeled by the string U 4 . 

Then, add arcs with the empty label from the new node to the correspond- 
ing antibullet nodes ©i that were already in the graph. Analogously, add arcs 
with the empty label from the nodes »i that were already in the graph to the 
new antibullet node ©j. 



© 

U4 

4 

1 

Fig. 2. Rule i Expansion Components 

In the case of the graph in Figure 1, if we add the rule expansions of the two 
rules of the system, then we get the graph of Figure 2. 




3.2 Cross Bullet Expansion 

Now we define the cross bullet expansion procedure. Consider a symbol o, which 
we call empty bullet. For this symbol, h is assumed to be a deleting function, 
that is, h{o) = A. Suppose that in Gg there is a cycle Qj — •j- A cycle introduces 
new factorization possibilities that are not explicitly present in the graph, but 
that appear when we go around the cycle a certain number of times. Let us 
assume that, by iterating this cycle, a path 9 is obtained that generates a new 
splicing site for some rule. We distinguish two cases. In the first case, which 
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Fig. 3. The Rule Expansion of the Graph of Figure 1 



we call 1-cross bullet expansion, the path 9, considered as a string, includes the 
splicing site U\U 2 of a rule 

9 = 

for some ^ beginning with a symbol of V such that h{^) = UiU 2 - In the second 
case, which we call 4^-cross bullet expansion, the path 9, considered as a string, 
includes the splicing site M 3 M 4 of a rule 

9 = 77 C /3 

for some ^ ending with a symbol of V such that h(^) = U3U4. As it is illustrated 
in the following pictures, the positions where the beginning of ^ or the end of ^ 
are respectively located could be either external to the cycle or internal to it. In 
both cases we insert an empty bullet in the path 9. 

— In the case of a 1-cross bullet expansion, we insert an empty bullet o, exactly 
before with an arrow going from this empty bullet to the cross bullet of the 
Ml expansion component of Let 1/t be the type of this empty bullet. This 
expansion is performed unless rj does not already include an empty bullet of 
type 1 ji after its last symbol of V. 

— In the cases of a 4-cross bullet expansion, we insert an empty bullet o, exactly 
after with an arrow, entering this empty bullet and coming from the cross 
bullet of the M 4 expansion component of r^. Let 4/z be the type of this 
empty bullet. This expansion is performed unless P does not already include 
an empty bullet of type 4/z before its first symbol of V. 

The two cases are illustrated in the following pictures (in 1-cross bullet expansion 
the beginning of the splicing site is external to the cycle, in the 4-cross bullet 
expansion the end of the splicing point is internal to the cycle) . 

The general situation of cross bullet expansion is illustrated in Figure 3. 
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7 I cycle 

, 

3 a 3 






Ml 




Fig. 4. 1-Cross bullet Expansion: 7 a” includes a string of h ^{uiU 2 ) as its prefix 



cycle 




Fig. 5. 4-Cross bullet Expansion: a” includes a string of h ^{usU 4 ) as its suffix 



If we apply cross bullet expansion to the graph of Figure 3 we get the graph 
of Figure 7 where only one 1-cross bullet expansion was applied. The following 
lemma establishes the termination of the cross bullet expansion procedure. 

Lemma 1. If, starting from Ge, we apply again and again the cross bullet ex- 
pansion procedure, then the resulting process eventually terminates, that is, after 
a finite number of steps we get a final graph where no new cycles can be intro- 
duced. 

Proof. This holds because in any cross bullet expansion we insert an empty bul- 
let o and an arc with the empty label connecting it to a cross node, but empty 
bullets, at most one for each type, are always inserted between two symbols of 
V starting from the graph Ge, which is fixed at beginning of the cross bullet 
expansion process. Therefore, only a finite number of empty bullets can be in- 
serted. This ensures that the expansion process starting from Ge will eventually 
stop. 

Let G be the completely expanded graph. The cross bullet expansion procedure 
is performed by a (finite) sequence of steps. At each step an empty bullet is 
inserted and an arc is added that connects the empty bullet with a cross bullet. 
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Fig. 7. Cross bullet Expansion of the Graph of Figure 3 



Let L{G) be the language of the strings generated by paths of G. The inclusion 
L(G) C L{r) can be easily shown by induction on the number of cross bullet 
expansion steps: obviously L{Gq) C L{F), thus, assume that all the paths of 
the graph at step i generate strings of L{r). Then the paths in the expanded 
graph at step i +1 generate strings spliced from paths which are present at step 
i. Therefore, the inclusion holds for the completely expanded graph. 

For the inverse inclusion we need the following lemma which follows directly 
from the method of the cross bullet expansion procedure. 

Lemma 2. In the completely expanded graph G, when there is a path ijOcrp where 
h{9) = ui, h{a) = U 2 and U\U 2 is the splicing site of the rule ri, then also a path 
rjB'ui occurs in G with h{9) = h{9'). Analogously, if in G there is a path r]9ap 
where h{9) = U 3 , h{a) = U 4 and U 3 U 4 is the splicing site of the rule ri, then also 
a path Qicr' p occurs in G with h{a) = h{a'). 
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The inclusion L{r) C L{G) can be shown by induction on the number of 
splicing steps. If rj is an axiom, the condition trivially holds. Assume that rj 
derives, by means of a rule r^, from two strings. This means that these strings 
can be factored as: 

[3 

lQi5 

and, by the induction hypothesis, in G there are two paths 9, p generating aj3 
and 7(5, respectively. These paths include as sub-paths h~^{u\U 2 ), 
respectively, therefore, according to the previous lemma, a path a*i is in G 
where h{a) = a and a path ©^tt is in G where h{Tr) = 5. This means that the 
path cr ©iTT is in G, but h{a ©^tt) = rj, therefore rj is generated by a path 
of the completely expanded graph. In conclusion, L{r) = L{G). The language 
L{G) is regular because it is easy to define it by means of a regular expression 
deduced from G. 
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The first time I met Professor Tom Head, just in the year 1971, was when 
I became interested in the triangle linguistics-molecular genetics-mathematics. 
Tom Head is a pioneer in the theory of splicing, which became an important 
component of the new domain of DNA computing ([5,14,7]). It is nice for me to 
remember that in December 1971 I had the chance to be invited by Professor 
Head at the University of Alaska, where I gave a series of lectures. A few months 
earlier, in July-August 1971, I organized within the framework of the Linguistic 
Institute of America (at State University of New York at Buffalo, under the 
direction of Professor David Hays) a research seminar on the above mentioned 
triangle. My starting point for the seminar was a book by Professor Z. Pawlak, 
a famous Polish computer scientist, and the writings of Roman Jakobson on the 
link between linguistics and molecular genetics. One of the participants at this 
Seminar was Bernard Vauquois, one of the initiators of ALGOL 60. As a result of 
this seminar, I published the article [10] and much later [11], where the interplay 
nucleotide bases-codons-amino acids-proteins is analyzed in the perspective of 
structural linguistics and of formal language theory. 

In a retrospect, the notion of splicing was in the immediate neighbourhood 
of the ideas investigated there, but we were not able to invent it and to give it an 
explicit status. It was the privilege of Tom Head [5,6] to identify it and to show its 
generative capacity. In a further steps, [14,7,16,4] developed a comprehensive the- 
ory of DNA computing, based on the splicing operation. However, concomitantly 
there was another line of development where splicing was not involved, although 
potentially it is implied and its relevance can be shown: the Human Genome 
Project (HGP). It started around 1990 and was directed towards sequencing the 
DNA, identifying the genes and establishing the gene-protein function correla- 
tion. This line of research brings to the center of attention the analytic aspects 
related to DNA, genes and codons. Obviously, in the context of Watson-Grick 
double strand organization, splicing, as a purely combinatorial-sequential op- 
eration related to DNA, finds its natural place and should be involved in the 
gene-protein interaction. But the aim of this note is more modest and of a very 
preliminary nature in respect to the problem we just mentioned. We have in 
view the so-called duality of patterning principle and its possible relevance for 
molecular genetics. This aspect is not without relation with the gene-protein 
interaction, because the duality is in strong relation with the arbitrariness of the 
genetic sign, a genetic equivalent of the problem of arbitrariness of the linguistic 
sign, discussed by Ferdinand de Saussure. 
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The duality of patterning principle (shortly: duality), discussed for the first 
time by Martinet [12,13] and called in French ‘principe de la double articulation’ 
asserts the existence in language of two levels of structural organization, phono- 
logical and grammatical, such that the segments of the latter level are composed 
of segments of the former level, called phonemes (see also [9] pages 71-76). There 
is here no reference to meaning and no reference to the quantitative aspect of 
the units involved in the respective levels. However, phonemes are meaningless 
units while forms of the higher level are meaningful units (the meaning may be 
lexical or grammatical). Moreover, there is a discrepancy between the relatively 
small number of units involved in the lower level and the relatively large number 
of units involved in the higher level. As in many other situations, the principle 
of least effort seems to be involved here too: try to obtain as many as possible 
lexical and grammatical forms by using as few as possible elements of the lower 
level (the higher level corresponds to the first articulation, while the lower level 
corresponds to the second articulation) . Indeed, the number of phonemes is usu- 
ally between ten and hundred, while the number of words (or of morphemes) is 
usually larger than ten thousands. 

When Martinet [12,13] claims that duality is a specific feature of natural 
languages, things become very provocative. In other words, no other semiotic 
system, particularly no artificial language is endowed with the duality of pat- 
terning. In contrast with this claim, researchers in molecular genetics who trusted 
the language status of DNA, RNA and proteins claimed just the opposite: du- 
ality is an important aspect of molecular genetics or, how Ji calls it [8], of the 
cell language. 

In [10,11], I gave a detailed presentation of various proposals of interpreta- 
tion of duality in molecular genetics and also I gave my interpretation, arguing 
in its favor and against other interpretations. In the meantime, other proposals 
were published, for instance, those of Collado-Vides [2,3], of Bel-Enguix [1] and 
of Ji [8]. Our interpretation was to assimilate the first articulation with the level 
of codons and the second articulation with the level of nucleotide bases. This 
means that the former are the morphemes of the genetic language, while the 
latter are its phonemes. The genetic phonemes are endowed with chemical sig- 
nificance, while the genetic morphemes are endowed with biological significance 
(because they encode, via the genetic dictionary, some amino acids or some ‘orto- 
graphic’ indication, such as ‘start’ or ‘stop’). Ji ([8] page 155) proposes a unique 
framework for what he calls “human and cell languages” ^ under the form of a 
6-tuple L = (A, IF, S', G, P, M), where A is the alphabet, W is the vocabulary 
or lexicon (i.e., a set of words), S is an arbitrary set of sentences, G is a set of 
rules governing the formation of sentences from words (the first articulation, in 
Ji’s interpretation), as well as the formation of words from letters (the second 
articulation), P is a set of mechanisms realizing and implementing a language, 
and finally M is a set of objects (both mental and material) or processes referred 
to by words and sentences. For Ji, the first articulation of the cell language is 

^ As a matter of fact, for natural and genetic languages; there are many human lan- 
guages other than the natural ones, for instance all artificial languages. 
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identified with the spatio-temporal organization of gene expressions through the 
control of DNA folding patterns via conformational (or noncovalent) interac- 
tions, while the second articulation is identified with the linear arrangement of 
nucleotides to form structural genes through covalent (or configurational) inter- 
actions. Ji rediscovers another duality at the level of proteins, where the second 
articulation is identified with covalent structures, while the first articulation is 
identified with the three-dimensional structures of polypeptides formed through 
conformational (or noncovalent) interactions. As it can be seen, for Ji there are 
a total of four articulations of the genetic language and the condition of linearity 
and sequentiality of a language structure is no longer satisfied. The concept of a 
language as a special sign system is in question, and it is no longer clear where 
is the border, if there exists such a border, between an arbitrary sign-system 
and a language-like sign system. Both Martinet [13] and Lyons ([9] page 74-76) 
stress the link between duality and arbitrariness. Duality points out to some 
limits of the arbitrariness of the linguistic sign. Can we speak, equivalently, of 
the arbitrariness of the “genetic sign”? To what extent is heredity the result of 
the combinatorial game expressed by the first articulation, under the limits and 
the freedom imposed by the second articulation? 
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Abstract. We introduce four new variants of P systems, which we call 
non-standard because they look rather “exotic” in comparison with sys- 
tems investigated so far in the membrane computing area: (1) systems 
where the rules are moved across membranes rather than the objects 
processed by these rules, (2) systems with reversed division rules (hence 
entailing the elimination of a membrane when a membrane with an iden- 
tical contents is present nearby), (3) systems with accelerated rules (or 
components), where any step except the first one takes half of the time 
needed by the previous step, and (4) reliable systems, where, roughly 
speaking, all possible events actually happen, providing that “enough” 
resources exist. We only briefly investigate these types of P systems, the 
main goal of this note being to formulate several related open problems 
and research topics. 



1 Introduction 

This paper is addressed to readers who are already familiar with membrane 
computing - or who are determined to become familiar from sources other than 
a standard prerequisites section - which will not be present in this paper. Still, 
we repeat here a many times used phrase: “A friendly introduction to membrane 
computing can be found in [5], while comprehensive details are provided by [4], 
or can be found at the web address http://psystems.disco.unimib.it.” 

Of course, calling “exotic” the classes of P systems we are considering here is a 
matter of taste. This is especially true for the first type, where the objects never 
change the regions, but, instead, the rules have associated target indications 
and can leave the region where they are used. Considering migratory rules can 
remind of the fact that in a cell the reactions are catalyzed/driven/controlled 
by chemicals which can move through membranes like any other chemicals - the 
difference is that here the “other chemicals” are not moving at all. Actually, P 
systems with moving rules were already investigated, by Rudi Freund and his 
co-workers, in a much more general and flexible setup, where both the objects 
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and the rules can move (and where universality results are obtained - references 
can be found in [4]). 

We find the functioning of such systems pretty tricky (not to say difficult), 
and only a way to generate the Parikh sets of matrix languages (generated by 
grammars without appearance checking) is provided here. Maybe the reader will 
prove - or, hopefully, disprove - that these systems are universal. 

The second type of “exotic” systems starts from the many-times-formulated 
suggestion to consider P systems where for each rule u ^ v one also uses the 
rule V u. This looks strange in the case of membrane dividing rules, [ a] . ^ 
[.61 .[.d where we have to use the reversed rule [.61 .[.d . ^ [ al ■, with the 
meaning that if the two copies of membrane i differ only in the objects 6, c, then 
we can remove one of them, replacing the distinguished object of the remaining 
membrane by a. This is a very powerful operation - actually, we have no precise 
estimation of “very powerful” , but only the observation that in this way we can 
compute in an easy manner the intersection of two sets of numbers/ vectors which 
are computable by means of P systems with active membranes. 

The third idea is to imagine P systems where the rules “learn from experi- 
ence” , so that the application of any given rule takes one time unit at the first 
application, half of this time at the second application, and so on - the (z -I- l)th 
application of the rule takes half of the time requested by the zth application. 
In this way, arbitrarily many applications of a rule - even infinitely many! - 
take at most two time units. . . Even a system which never stops (that is, is able 
to continuously apply rules) provides a result after a known number of external 
time units. Note this important difference, between the external time, measured 
in “physical” units of equal length, and the computational time, dealing with 
the number of rules used. When the rules are to be used in parallel, delicate 
synchronization problems can appear, because of the different time interval each 
rule can request, depending on how many times each rule has already been used 
in the computation. 

However, characterizations of Turing computable numbers are found by 
means of both cooperative multiset rewriting-like rules and by antiport rules of 
weight 2 where the rules are accelerated. Clearly, the computation of infinite sets 
is done in a finite time. Universality reasons show that all Turing computable 
sets of numbers can be computed in a time bounded in advance. A concrete 
value of this bound remains to be found, and this seems to be an interesting 
(mathematical) question. 

Of course, the acceleration can be considered not only at the level of rules, but 
also at the level of separate membranes (the first transition in such a membrane 
takes one time unit, the second one takes half of this time, and so on), or directly 
at the level of the whole system. These cases are only mentioned and left as a 
possible research topic (for instance, as a possible way to compute “beyond 
Turing” - as it happens wuth accelerated Turing machines, see [1], [2], [3]). 

The same happens with the idea to play a similar game with the space 
instead of the time. What about space compression (whatever this could mean) 
or expansion (idem)? The universe itself is expanding, so let us imagine that in 
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some regions of a P system the objects (and membranes) are doubled from time 
to time, just like that, without applying individual rules - still, some rules for 
controlling this global doubling should be considered. Any quantum computing 
links? Especially: any application in “solving” hard problems in a fast way? 

Finally, the fourth idea originates in the observation that there is a striking 
difference between classic computer science (complexity theory included), based 
on (assumed) deterministic functioning of computers, and bio-computations, as 
carried in DNA computing and as imagined in membrane computing: if we have 
“enough” molecules in a test tube, then all possible reactions will actually hap- 
pen. This is far from a mathematical assertion, but many DNA computing ex- 
periments, including the history-making one by Adleman, are based on such 
assumptions (and on the mathematical fact that a positive solution can be ef- 
fectively checked in a feasible time). The belief is that reality arranges things 
around the average, it does not deal with worst cases; probabilistically stated, 
“the frequency tends to the probability” . If something has a non-zero probability, 
then eventually it will happen. We just have to provide “enough” possibilities 
to happen (attempts, resources, etc). 

Here, we formulate this optimistic principle in the following way: if we have 
“enough” objects in a membrane, then all rules are applied in each transition. 
The problem is to define “enough” . . . and we take again an optimistic decision: 
if each combination of rules has polynomially many chances with respect to the 
number of combinations itself, then each combination is applied. This is espe- 
cially suitable (and clear) for the case of string-objects: if we have n combinations 
of rules which can be applied to an existing string, then each combination will 
be applied to at least one string, providing that we have at least n • p{n) copies 
of the string available, for some polynomial p{x) specific to the system we deal 
with. A P system having this property is said to be reliable. 

Clearly, a deterministic system is reliable, and the polynomial p{x) is the 
constant, p(n) = 1, so of more interest is to assume this property for nondeter- 
ministic P systems. Such systems are able to solve NP-complete problems in 
polynomial time. We illustrate this possibility for SAT, which is solved in linear 
time by means of reliable systems with string-objects able to replicate strings of 
length one (in some sense, we start with symbol-objects processed by rules of 
the form a bb, we pass to strings by rules b ^ w, and then we process the 
string-objects w and the strings derived from them, as usual in P systems, with 
string-objects). 



2 Moving Rules, Not Objects 

We consider here only the symbol-object case, with multiset rewriting- like rules. 
After being applied, the rules are supposed to move through membranes; this 
migration is governed by a metarules of the form (r, tar), existing in all regions, 
with the meaning that the multiset-processing rule r, if present in the region, 
after being used has to go to the region indicated by tar; as usual, tar can be 
one of here, out, in. 
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More formally, a P system with migrating rules is a construct 

n — j /X, tCi, ■ . ■ , Wrm ■ ■ ■ , Ryni -^1, ■ ■ ■ , to), 



where: 

1. O is the alphabet of objects; 

2. C C O is the set of catalysts; 

3. T C (O — C) is the set of terminal objects; 

4. i? is a finite set of rules of the form u ^ v, where u € 0'^,v G O*; the 
catalytic rules are of the form ca ^ cu for c G C, a G O—C, and u G (O—C)*; 

5. /X is a membrane structure of degree m, with the membranes labeled in a 
one to one manner with 1, 2 , . . . , m; 

6. wi, , Wm are strings over O specifying the multisets of objects present in 
the m regions of /x at the beginning of the computation; 

7. i?i, . . . , Rm are subsets of R, specifying the rules available at the beginning 
of the computation in the m regions of yi; 

8. Di, . . . , Dm are sets of pairs - we call them metarules - of the form (r, tar), 
with r G R and tar G {here, out, in} associated with the m regions of /x; 

9. 1 < Xo < XXX is the output membrane of the system, an elementary one in pL. 

Note that the objects from regions are present in the multiset sense (their 
multiplicity matters), but the rules are present in the set sense - if present, a 
rule is present in principle, as a possibility to be used. For instance, if at some 
step of a computation we have a rule r in a region i and the same rule r is sent 
in i from an inner or outer region to i, then we will continue to have the rule r 
present, not two or more copies of it. 

However, if a rule r is present in a region i, then it is applied in the nondeter- 
ministic maximally parallel manner as usual in P systems: we nondeterministi- 
cally assign the available objects to the available rules, such that the assignment 
is exhaustive, no further object can evolve by means of the existing rules. Objects 
have no target indication in the rules of R, hence all objects obtained by apply- 
ing rules remain in the same region. However, the applied rules move according 
to the pairs (r, tar) from Di. 

Specifically, we assume that we apply pairs (r, tar), not simply rules r. For in- 
stance, if in a region i we have two copies of a and two metarules (r, here), (r, out) 
for some r : a ^ b which is present in region i, then the two copies of a can 
evolve both by means of (r, here), or one by (r, here) and one by (r, out), or both 
by means of {r,out). In the first case, the rule r remains in the same region, in 
the second case the rule will be available both in the same region and outside 
it, while in the third case the rule r will be available only in the region outside 
membrane i (if this is the environment, then the rule might be “lost”, because 
we cannot take it back from the environment) . Anyway, both copies of a should 
evolve (the maximality of the parallelism). If r is not present, then a remains un- 
changed; if a is not present, then the rule is not used, hence it remains available 
for the next step. 
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Thus, a rule can be used an arbitrarily large number of times in a region, 
and for each different target tar from pairs (r, tar) used, a different region of the 
system will have the rule available in the next step. 

The sets 0^,1 < i < m, are not necessarily “i?-complete” , that is, not each 
rule r G R has a metarule (r, tar) in each region; if a rule r arrives in a region i 
for which no metarule (r, tar) is provided by Di, then the rule cannot be used, it 
is ignored from that step on (in region i). Thus, in each set Di we can have none, 
one, two, or three different pairs (r,tar) for each r, with different tar S {here, 
out, in}. 

The computation starts from the initial configuration of the system, that 
described by wi, ■ . ■ ,Wm, Ri, ■ ■ ■ , Rm, and evolves according to the metarules 
from Di, . . . , Dm- With a halting computation we associate a result, in the form 
of the vector ^t{w) describing the multiplicity of elements from T present in the 
output membrane io in the halting configuration (thus, w G T* , the objects from 
O — T are ignored). We denote by Ps{U) the set of vectors of natural numbers 
computed in this way by 77. 

As usual, the rules from R can be cooperative, catalytic, or noncooperative. 
We denote by PsRMPm{a) the family of sets of vectors Ps{II) computed as 
above by P systems with rule migration using at most m > 1 membranes, with 
rules of type a S {eoo, eat, neoo}. 

The systems from the proof of the next theorem illustrate the previous defi- 
nition - and also shed some light on the power of P systems with migrating rules. 
We eonjecture (actually, we mainly hope) that these systems are not universal. 
(In the theorem below, PsMAT is the family of Parikh images of languages 
generated by matrix grammars without appearance checking.) 



Theorem 1. PsMAT C PsRMP 2 {cat). 

Proof. Let us consider a matrix grammar without appearance checking G = 
{N, T, S, M) in the binary normal form, that is, with N = Ni LI N 2 U {S'} and 
the matrices from M of the forms (1) {S XA), X G Xi,A G N 2 , (2) {X 
Y,A^ x), X,Y G Ni, A G N 2 ,x G (iVa U T)*, and (3) {X ^ X, A ^ x), X G 
Ni,AgN 2 ,xGT*] the sets Ni,N 2 ,{S} are mutually disjoint and there is only 
one matrix of type (1). We assume all matrices of types (2) and (3) from M 
labeled in a one to one manner with elements of a set 77. Without any loss of 
generality we may assume that for each symbol X G Ni there is at least one 
matrix of the form (X —fa,A—^x)inM (if a symbol X does not appear in such 
a matrix, then it can be removed, together with all matrices which introduce it 
by means of rules of the form Z ^ X). 

We construct the P system with migrating rules 



n = (O, {c},r, 7 ?, [J2 ]^]j^,wi,W2,Ri,R2,Di,D2,l), 
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where: 



0 = Ni U iVa U T U {^' 1 ^ G fVzl U {h, h' , h” , h'” \ h£H}\j{c, #}, 


II 

’b- 

1 


h, 


X2,h ■ h 


h', 


r3,h ■ h' - 


> h", 


T4,h ■■ h" - 


- h'", 


T5,h : ch'" 


c#, 


r^,h : h'" - 


a, 


ri,h : h 


Ah”, 


rs,h ■ cA - 


ex' 1 for all h : {X ^ a, A ^ x) £ M, 



X G a G iVi U {A}, AeN2,xe {N 2 U T)*} 

U {tb : B' ^ B, 

r'^ : B ^ B \ for all B £ N2} 

U {rx : X ^ XX, 

r'x ■■ X ^ X\for all X £ TVi} 

U {roo : # ^ #}, 

where x' is the string obtained by priming all symbols 
from N2 which appear in a; G {N2 U T)* and 
leaving unchanged the symbols from T, 
wi=cXi...X,, for Ni = {Xi,X2,...,X.,},s>l, 

W2 = cXA, for {X XA) being the initial matrix of G, 

Ri = {r4,h, re^h, r-j^h, rs,h \ h £ H} 

U {rx, \ X £ fVi}, 

R 2 = {riM, r2,h, rs^h, r^^h, r^,h \ h £ H) 

U {xB, r'g\B £ N2} 

U {roo}, 

and with the following sets of metarules: 

D\ contains the pair (r, here) for all rules r £ R with the exception of the rules 
from the next pairs 

{ri_h,out), (rQ^hjOut), (rs,h,out), for all h£ H, which are in Di, too; 

D 2 contains the pair (r, here) for all rules r £ R with the exception of the rules 
from the next pairs 

(ri.?i,m), (rs,hAxi), for all h £ H, which are in D 2 , too. 

Note that the metarules are “deterministic” , in the sense that for each r £ R 
there is only one pair (r, tar) in each of D\ and D 2 , and that the only migrating 
rules are ri^h,xe^h,xs,h, for each h £ H. Initially, ri^h is in region 2 and rQ^h,xs,h 
are in region 1. 
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For the reader convenience, the initial configuration of the system, includ- 
ing the metarules different from (r, here) from the two regions, is pictorially 
represented in Figure 1. 




Figure 1: The P systems with migrating rules from Theorem 1. 

The skin region contains the rules X XX and X ^ A for each X G Ni] 
they can produce any necessary copy of symbols from Ni and can also remove 
all such symbols in the end of a computation. The inner region contains a rule 
B ^ B for each B G N 2 , hence the computation will continue as long as any 
symbol from N 2 is present. 

Assume that we start with a multiset cXw, X G Ni,w G {N 2 UT)* in region 
2 - initially, we have here cXA, for XA introduced by the initial matrix of G. 
The only possibility (besides rules r'g : B B which we will ignore from now 
on) is to use a rule of the form ri_h for some matrix h : (X a, A —> x). We 
obtain the object h in region 2, and the rule ri ft, goes to region 1. 

Assume that in region 1 we have available a copy of A, moreover, we use it 
for applying the rule ri^ft; thus, the rule ri^ft returns to the inner region. At the 
same time, in region 2 we use the rule r 2 ,ft : h h' . Now we have h in region 
1 and h' in region 2; no nonterminal from Ni is available in region 2, so no rule 
of type r\^g,g G H, can be used. Simultaneously, we use now ■ h' h" in 
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region 2 and r-j^h '■ h Ah” in region 1. Both rules remain in the respective 
regions. 

We continue with ■ h” h”' in region 2, and with both '■ h” h'" 
and : cA ex' in region 1. The first two rules remain in the respective 

regions, the last one goes to region 2. In region 2 we have available both rules 
Th,h '■ ch"' c# and : cA ex' (but not also ■ h”' a, which is still 
in region 1). If the rule : eh'" c# is used, then the computation will never 
finish. Thus, we have to use the rule '■ cA ex' , which simulates the second 

rule of the matrix h, with the nonterminals of x primed. In parallel, in region 1 
we have to use the rule re^h ■ h”' — > a, which has then to enter region 2. In this 
way, at the next step in region 2 we can avoid using the rule ■ ch'” c# 
and use the rule ^ '■ h'” — > a, which completes the simulation of the matrix h 
(in parallel, the rules - always available - return x' to x). 

If X is not present in region 1, or only one copy of X is present and we do not 
apply to it the rule X ^ h, then the rule is not used in the next step, hence 
neither and after that, and then the use of the rule '■ ch'” c# 
cannot be avoided, and ^ is intoduced in membrane 2, making the computation 
never halt. 

In the case when a S Ni, the process can be iterated: after simulating any 
matrix, the rules ri^h,i"e^h,rs,h are back in the regions where they were placed 
in the beginning; note that the symbols from N 2 present in region 1 are primed, 
hence the rules ^ cannot use them, while the possible symbol Y introduced in 
region 1 is just one further copy of such symbols already present here. 

In the case when a = X, hence h was a terminal matrix, the computation 
stops if no symbol B G N 2 is present in region 2 (hence the derivation in G 
is terminal), or continue forever otherwise. Consequently, Ps{N) = !?t(T(G)), 
which shows that we have the inclusion PsMAT C PsRMP 2 {cat). 

This is a proper inclusion. In fact, we have the stronger assertion 
PsRMPi{neoo) — PsMAT yf 0, which is proved by the fact that one-membrane 
P systems can compute one-letter non-semilinear sets of numbers (such sets 
cannot be Parikh sets of one-letter matrix languages). This is the case with the 
following system 

n = ({a}, 0, {a}, {h: a ^ aa}, [ a, {h}, {{h, here), {h, out)}, 1). 

In each step we double the number of copies of a existing in the system; if we 
use at least once the metarule {h,here), then the computation continues, oth- 
erwise the rule is “lost” in the environment and the computation halts. Clearly, 
Ps{n) = 1(2”) I n > 1}, which is not in PsMAT. □ 

We do not know whether the result in Theorem 1 can be improved, by com- 
puting all Parikh images of recursively enumerable languages, or - especially 
interesting - by using only non-cooperative rules. 

The system constructed at the beginning of the previous proof has never 
replicated a rule, because in each region each rule had only one target associ- 
ated by the local metarules. In the case when the metarules specify different 
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targets for the same rule, we can either proceed as in the definition given above, 
or we can restrict the use of metarules so that only one target is used. On the 
other hand, we can complicate the matter by adding further features, such as the 
possibility to dissolve a membrane (then both the objects and the rules should 
remain free in the immediately upper membrane), or the possibility to also move 
objects through membranes. Furthermore, we can take as the result of a compu- 
tation the trace of a designated rule in its passage through membranes, in the 
same way as the trace of a traveller-object was considered in “standard” (sym- 
port/ antiport) P systems. Then, we can pass to systems with symport/ antiport 
rules; in such a case, out will mean that the rule moves up and gets associ- 
ated with the membrane immediately above the membrane with which the rule 
was associated (and applied), while in will mean going one step down in the 
membrane structure. Otherwise stated, the targets are, in fact, here, up, down. 

Plenty of problems, but we switch to another “exotic” idea. 

3 Counter-Dividing Membranes 

The problem to consider reversible P systems was formulated several times - 
with “reversible” meaning both the possibility to reverse computations, like in 
dynamic systems, and local/strong reversibility, at the level of rules: considering 
V ^ u &t the same time with u v. For multiset rewriting-like rules this does 
not seems very spectacular (although most proofs from the membrane computing 
literature, if not all proofs, will get ruined by imposing this restriction to the 
existing sets of rules, at least because the synchronization is lost). The case 
of membrane division rules looks different/strange. A rule [^a]^ 
produces two identical copies of the membrane i, replicating the contents of the 
divided membrane, the only difference between the two copies being that objects 
b and c replace the former object a. Moreover, generalizations can be considered, 
with division into more than two new membranes, with new membranes having 
(the same contents but) different labels. Also, we can divide both elementary 
and non-elementary membranes. 

Let us stay at the simplest level, of rules [.a] . ^ . which deal with 

elementary membranes only, 2 division, and the same label for the new mem- 
branes. Reversing such a rule, hence considering a rule [ 6] .[ c] . ^ [ a] ., will 
mean to pass from two membranes with the label i and identical contents up to 
the two objects b and c present in them to a single membrane, with the same 
contents and with the objects b,c replaced by a new object, a. This operation 
looks rather powerful, as in only one step it compares the contents of two mem- 
branes, irrespective how large they are, and removes one of them, also changing 
one object. 

How to use this presumed power, for instance, for obtaining solutions to 
computationally hard problems, remains to be investigated. Here we are only 
mentioning the fact that this counter-division operation can be used in order to 
build a system which computes the intersection of the sets of numbers/ vectors 
computed by two given systems. The idea is suggested in Figure 2. Consider two 
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P systems, 7Ti and 772. Associate with each of them one membrane with label 
i. Embed one of the systems, say II 2 , together with the associated membrane 
i, in a further membrane, with label 2, and then all these membranes into a 
skin membrane (1, in Figure 2). Let the systems 7Ti,7l2 work and halt; the 
objects defining the results are moved out of the two systems and then into 
the corresponding membranes with label i. At some (suitable) moment, dissolve 
membrane 2. Now, by a rule ^ check whether or not the 

two systems have produced the same number (or vector of numbers), and, in 
the affirmative case, we stop. (For instance, each membrane i can evolve forever 
by means of suitable rules applied to b and c, respectively; after removing one 
of the membranes, the object a newly introduced will not evolve, hence the 
computation can stop.) Of course, several details are to be completed, depending 
on the type of the systems we work with. 




Now, again the question arises: is this way to compute the intersection of any 
usefulness? We avoid addressing this issue, and pass to the third new type of P 
systems. 

4 Rules (and Membranes) Acceleration 

A new type of P systems, but not a new idea: accelerated Turing machines were 
considered since many years, see [1], [2], [3] and their references. Assume that 
our machine is so clever that it can learn from its own functioning, and this 
happens at the spectacular level of always halving the time needed to perform a 
step; the first step takes, as usual, one time unit. Then, in 1 + ^ + j + . . . + ^ + 
. . . = 2 time units the machine performs an infinity of steps, hence finishing the 
computation. . . Beautiful, but exotic. . . 

Let us consider P systems using similarly clever rules, each one learning 
from its own previous applications and halving the time needed at the next 
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application. Because at the same time we can have several rules used in parallel, 
in the same region or in separate regions, a problem appears with the cooperation 
of rules, with the objects some rules are producing for other rules to use. As it 
is natural to assume, such objects are available only after completing the use of 
a rule. Because the rules have now different “speeds” of application, we have to 
take care of the times when they can take objects produced by other rules. 

How long - in terms of time units, not of computational steps (of used rules) 
- lasts a computation? For Turing machines, two time units are sufficient for 
carrying out any computation. In our case, each rule can take two time units, 
so a system with n rules, not necessarily used in paralel, will compute for at 
most 2n time units. Thus, if we were able to find accelerated P systems able to 
compute all Turing computable sets of numbers/ vectors, then an upper bound 
can be obtained on the time needed for all these computations, by considering 
a universal P system (hence with a given number of rules). The only problem 
is to find P systems (of various types) able to compute all computable sets of 
numbers/ vectors in the accelerated mode. 

It is relatively easy to find such systems - the trick is to ensure that no two 
rules are used in parallel, and then no synchronization problem appears. 

Consider first the case of symbol-object P systems with multiset rewriting- 
like rules of a cooperative type. It is easy to see that the P system constructed 
in the proof of Theorem 3.3.3 from [4] computes the same set of numbers with 
or without having accelerated rules (with the exception of rules dealing with the 
trap symbol which anyway make the computation to never stop, all other 
rules are not used in parallel). 

The same assertion is obtained also for P systems with symport/antiport 
rules - specifically, for systems with antiport rules of weight 2 and no symport 
rule. Because of the technical interest in the proof of this assertion, we give it in 
some details. Let N{II) be the set of numbers computed by a P system II and 
let NOPm{acc, synir, antis) be the family of sets N{II) computed by systems 
with at most m membranes, using accelerated symport rules of weight at most r 
and antiport rules of weight at most s. By N RE we denote the family of Turing 
computable sets of natural numbers (the length sets of recursively enumerable 
languages) . 

Theorem 2. NRE C NOP\{acc, symo,anti 2 )- 

Proof. Consider a matrix grammar with appearance checking G = 
{N, T, S, M, F) in the Z-binary normal form. Therefore, we have N = Ni U 
N 2 U {S, with these three sets mutually disjoint, and the matrices in M 

are in one of the following forms: 

1. (S'^ AA), with X G 7Vi,A G iVa, 

2. {X ^ Y, A ^ x), with X,Y G Xi,A G N 2 , x G (iVa U T)*, |x| < 2, 

3. (A ^ y, A ^ #), with A G Ai, y G Ai U {Z}, A G Aa, 

4. (y^A). 

Moreover, there is only one matrix of type 1, F consists exactly of all rules 
A ^ appearing in matrices of type 3, and if a sentential form generated by 
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G contains the object Z, then it is of the form Zw, for some w G (T U {#})* 
(that is, the appearance of Z makes sure that, except for Z and, possibly, #, all 
objects which are present in the system are terminal); ^ is a trap-object, and 
the (unique) matrix of type 4 is used only once, in the last step of a derivation. 

We construct the following P system with accelerated antiport rules: 

7T=(0,T,[J^,XA,0,i?i,l), 

o = IVi U iV2 U T U {A', Ya,Y^\Y A {Z}} 

U {{Ya) \ Y € Ni,a€ N 2 UTU{X}}U{Z,#}, 

Ri = {{XA,out; {Yai)a 2 ,in), 

((Foi), out; Yaijin) \ for (X —>■ Y, A —>■ aia 2 ) € M, 
with X,Y G Ni,A G N 2 , Oi, «2 G iV 2 U T U {A}} 

U {{X,out;YAA' ,in), 

{A' A, out; #, in), 

{YA,out; Y'A,in), 

{Y^A', out; Y, in) \ for {X ^Y,A^ #) G M, 
with X G Ni,Y G N 2 yJ {Z}, A G fV 2 } 

U {(#,out;#,in)} 

U i(X,out;X,in) | X G IVi}. 

The equality N{U) = {|r(;| | ru G L{G)} can easily be checked, even for 
the accelerated way of working. Indeed, never two rules are used in parallel, 
hence the acceleration makes no trouble. Now, the matrices {X ^ Y,A ^ «ia 2 ) 
without appearance checking rules can be directly simulated by means of rules 
{XA, out; (Yai)a 2 , in), {(Yai), out; Ya\,in). A matrix {X ^ Y, A ^ ^) is sim- 
ulated as follows. First, we use {X , out;YAA' ,in) , hence X is sent to the envi- 
ronment and Ya,A' enter the system (note that all objects are available in the 
environment in arbitrarily many copies) . If any copy of A is present in the sys- 
tem, then the rule {A' A, out; in) must be used and the computation will never 

halt. If A is not present, then the object A' waits in the system for the object 
which is introduced by the rule (YA,out;Y^,in). By the rule (Y^A' , out;Y,in) 
we then send out the auxiliary objects Y^,A' and bring in the object Y, thus 
completing the simulation of the matrix. As long as any symbol A G is 
present, the computation continues, at least by the rule (A, out; A, in). Because 
always we have at most one object A G Ni in the system, no synchronization 
problem appears. When the symbol Z is introduced, this means that the deriva- 
tion in G is concluded. No rule processes the object Z in the system 77. If the 
object ^ is present, then the computation will continue forever, otherwise also 
the computation in 77 stops. Thus, we stops if and only if the obtained multiset 
- minus the object Z - corresponds to a terminal string generated by G, and 
this concludes the proof. □ 

Several research topics remain to be investigated also in this case. One of 
them was already mentioned: find a constant U such that each set 77 G NRE 
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can be computed in at most U time units by a P system of a given type (for 
instance, corresponding to the family NOPi{acc, symo, anti 2 ))- To this aim, it is 
first necessary to find a universal matrix grammar with appearance checking in 
the Z-normal form (or to start the proof from other universal devices equivalent 
with Turing machines). 

Then, it is natural to consider P systems where certain components are accel- 
erated. In two time units, such a component finishes its work, possibly producing 
an element from a set from NRE. Can this be used by the other membranes of 
the system in order to speed-up computations or even to go beyond Turing, com- 
puting sets of numbers which are not Turing computable? This speculation has 
been made from time to time with respect to bio-computing models, membrane 
systems included, but no biologically inspired idea was reported able to reach 
such a goal. The acceleration of rules and/or of membranes can be a solution - 
though not necessarily biologically inspired. 

Indeed, accelerated Turing machines can solve Turing undecidable problems, 
for instance, the halting problem, in the following sense. Consider a Turing ma- 
chine M and an input w for it. Construct an accelerated Turing machine M' , 
obtained by accelerating M and also providing M' with a way to output a signal 
s if and only if M halts when starting with w on its tape. Letting M' work, in 
two time units we have the answer to the question whether or not M halts on w. 
if M halts, then also M' halts and provides the sugnal s; if M does not halt, then 
neither M' halts, but its computation, though infinite, lasts only two time units. 
So, M halts on w if and only if M' outputs s after two time units. Again, it is 
important to note the difference between the external time (measured in “time 
units”), and the internal duration of a computation, measured by the number 
of rules used by the machine. Also important is the fact that M' is not a proper 
Turing machine, because “providing a signal s” is not an instruction in a Turing 
machine. (This way to get an information about a computation reminds of the 
way of solving decision problems by P systems, by sending into the environment 
a special object yes. The difference is that sending objects outside the system 
is a “standard instruction” in P systems.) Further discussions about accelerated 
Turing machines and their relation with “so-called Turing-Church thesis” can 
be found, e.g., in [2]. 

It is now natural to use acceleration for constructing P systems which can 
compute Turing non-computable sets of numbers (or functions, or languages, 
etc). In the string-objects case with cooperative rules this task is trivial: just 
take a Turing machine and simulate it by a P system; an accelerated Turing 
machine able of non- Turing computations will lead to a P system with the same 
property. 

The symbol-object case does not look similarly simple. On the one hand, we 
do not know such a thing like an accelerated deterministic register machine (or 
matrix grammar with appearance checking). On the other hand, we have to make 
sure that the simulation of such a machine by means of a P system of a given 
type faithfully corresponds to the functioning of the machine: the system has to 
stop if and only if the machine stops. This is not the case, for instance, with the 
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system from the proof of Theorem 2, because of the intrinsically nondeterministic 
behavior of iT: the trap symbol # can be introduced because of the attempt to 
simulate a “wrong” matrix, the derivation in G could correctly continue and 
eventually correctly halt, but the system fails to simulate it. It seems to be a 
challenging task to find an accelerated symport/ antiport system which can avoid 
this difficulty (and hence can compute Turing non-computable sets of numbers). 

Actually, in [3] one discusses ten possibilities of changing a Turing machine 
in such a way to obtain “hypercomputations” (computations which cannot be 
carried out by usual Turing machines); we do not recall these possibilities here, 
but we only point out them to the reader interested in “going beyond Turing”. 



5 Reliable P Systems 

As suggested in the Introduction, the intention behind the definition of reliable 
P systems is to make use of the following observation. Assume that we have some 
copies of an object a and two rules, a ^ b and a ^ c, in the same region. Then, 
some copies of a will evolve by means of the first rule and some others by means 
of the second rule. All combinations are possible, from “all copies of a go to 6” 
to “all copies of a go to c” . In biology and chemistry the range of possibilities is 
not so large: if we have “enough” copies of a, then “for sure” part of them will 
become b and “for sure” the other part will become c. What “enough” can mean 
depends on the circumstances. At our symbolic level, if we have two copies of 
a we cannot expect that “for sure” one will become b and one c, but starting 
from, say, some dozens of copies can ensure that both rules are applied. 

We formulate this observation for string rewriting. Assume that a string w 
can be rewritten by n rules (each of them can be applied to w). The system we 
work with is reliable if all the n rules are used as soon as we have at least n • 
copies of w, for a given constant k depending on the system. If we work with 
parallel rewriting and n possible derivations w Wi, 1 < i < n, are possible, 
then each one happens as soon as we have at least n • copies of w. 

We consider here polynomially many opportunities for each of the n alterna- 
tives, but other functions than polynomials can be considered; constant functions 
would correspond to a “very optimistic” approach, while exponential functions 
would indicate a much less optimistic approach. 

Instead of elaborating more at the general level (although many topics arise 
here: give a fuzzy sets, rough sets, or probabilistic definition of reliability; provide 
some sufficient conditions for it, a sort of axioms entailing reliability; investigate 
its usefulness at the theoretical level and its adequacy/limits in practical appli- 
cations/experiments), we pass directly to show a way to use the idea of reliability 
in solving SAT in linear time. 

Consider a propositional formula C in the conjunctive normal form, consist- 
ing of m clauses Cj, 1 < j < m, involving n variables Xi, 1 < i < n. We 
construct the following P system (with string-objects, and both replicated and 
parallel rewriting - but the rules which replicate strings always applicable to 
strings of length one only) of degree m: 
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n — (V, /i, 6o, A, . . . , A, ^1, ■ ■ ■ , Rm^-) 

V = {h\0 <i<r}U {ai, U, fi \ I < i < n}, 

^ [ll-2 ■ ■ ■ tm— itm ]m]m— 1 ' ’ 

Rm = {h 6*+i||6*+i I 0 < I < r - 1} 

U {br aitt2 ■ ■ ■ an} 

U {a^ Qi ^ ft \ I < i < n} 

U {(^i ^ ti, out) I if Xi appears in Cm, 1 < * < n.} 

U {(/i ^ fi, out) I if ^ Xt appears in Cm, 1 < * < n}, 

Rj = {(fi ^ ti, out) I if Xi appears in Cj, 1 ^ ^ 

U {{fi fi, out) I if ^ Xi appears in Cj, 1 ^ ^ 'a}, 
for all j = 1, 2, . . . , m — 1. 

The parameter r from this construction depends on the polynomial which 
ensures the reliability of the system - see immediately bellow what this means. 

The work of the system starts in the central membrane, the one with the 
label m. In the first r steps, by means of replicating rules of the form bi 
6i+i||6i+i, we generate 2'’ strings br (of length one; this phase can be considered 
as using symbol objects, and then the rules are usual multiset rewriting rules, 
bi bf+i)- In the next step, each br is replaced by the string 0102 . . • ,a„. No 
parallel rewriting was used up to now, but in the next step each Oj is replaced 
either by ti or by fi. We have 2” possibilities of combining the rules at ^ ti, Ui ^ 
fi, 1 < i < n. If 2^ is “large enough” with respect to 2", then we may assume 
that all the 2" combinations really happen. This is the crucial reliability-based 
step. For each of the 2” possibilities we need (2”)^ opportunities to happen. 
This means that we need to have a large enough r such that 2’’ > 2” • (2”)^, 
for a constant k (depending on the form of rules, hence on SAT). This means 
r > n{k + 1) - hence this is the number of steps we need to perform before 
introducing the strings aia 2 ...an in order to ensure that we have “enough” 
copies of these strings. 

After having all truth-assignments generated in membrane m, we check 
whether or not there is at least one truth-assignment which satisfies clause Cm', 
all truth-assignments for which Cm is true exit membrane m. In membrane m—1 
we check the satisfiability of clause Cm-i, and again we pass one level up all 
truth-assignments which satisfy Cm-i. We continue in this way until reaching 
the skin region, where we check the satisfiability of Ci. This means that after 
the r -|- 3 steps in membrane m we perform further to — 1 steps; in total, this 
means r -I- to -I- 2 steps. 

The formula C is satisfiable if and only if at least one string is sent out of 
the system in step r -|- to -I- 2. Because r is linearly bounded with respect to n, 
the problem was solved in a linear time (with respect to both n and to). 

Note that the system is of a polynomial size with respect to n and to, hence 
it can be constructed in polynomial time by a Turing machine starting from C. 
If we start from a formula in the 3-normal form (at most three variables in each 
clause), then the system will be of a linear size in terms of n and to. 
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Reliability seems to be both a well motivated property, closely related to 
the biochemical reality, and a very useful one from a theoretical point of view; 
further investigations in this area are worth pursuing. 
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Abstract. We study the P versus NP problem through membrane sys- 
tems. Language accepting P systems are introdnced as a framework al- 
lowing us to obtain a characterization of the P 7 ^ NP relation by the 
polynomial time unsolvability of an NP-complete problem by means of 
a P system. 



1 Introduction 

The P versus NP problem [2] is the problem of determining whether every 
language accepted by some non-deterministic algorithm in polynomial time is 
also accepted by some deterministic algorithm in polynomial time. To define 
the above problem precisely we must have a formal definition for the concept 
of an algorithm. The theoretical model to be used as a computing machine in 
this work is the Turing machine, introduced by Alan Turing in 1936 [10], several 
years before the invention of modern computers. 

A deterministic Turing machine has a transition function providing a func- 
tional relation between configurations; so, for every input there exists only one 
computation (finite or infinite), allowing us to define in a natural way when an 
input is accepted (through an accepting computation). 

In a non-deterministic Turing machine, for a given configuration several suc- 
cessor configurations can exist. Therefore, it could happen that for a given input 
different computations exist. In these machines, an input is accepted if there 
exists at least one finite accepting computation associated with it. 

The class P is the class of languages accepted by some deterministic Turing 
machine in a time bounded by a polynomial on the length (size) of the input. 
From an informal point of view, the languages in the class P are identified with 
the problems having an efficient algorithm that gives an answer in a feasible 
time; the problems in P are also known as tractable problems. 

The class NP is the class of languages accepted by some non-deterministic 
Turing machine where for every accepted input there exists at least one accepting 
computation taking an amount of steps bounded by a polynomial on the length 
of the input. 
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(c) Springer- Verlag Berlin Heidelberg 2004 




The P Versus NP Problem Through Cellular Computing with Membranes 339 



Every deterministic Turing machine can be considered as a non-deterministic 
one, so we have P C NP. In terms of the previously defined classes, the P versus 

NP problem can be expressed as follows: is it verified the relation NP CP? 

? 

The P C NP question is one of the outstanding open problems in theoret- 
ical computer science. The relevance of this question does not lie only in the 
inherent pleasure of solving a mathematical problem, but in this case an an- 
swer to it could provide an information of a high practical interest. For instance, 
a negative answer to this question would confirm that the majority of current 
cryptographic systems are secure from a practical point of view. On the other 
hand, a positive answer could not only entail the vulnerability of cryptographic 
systems, but this kind of answer is expected to come together with a general pro- 
cedure which will provide a deterministic algorithm solving any NP-complete 
problem in polynomial time. 

Moreover, the problems known to be in the class NP but not known to be 
in P are varied and of highest practical interest. An NP-complete problem is a 
hardest (in certain sense) problem in NP; that is, any problem in NP could be 
efficiently solved using an efficient algorithm which solves a fixed NP-complete 
problem. These problems are the suitable candidates to attack the P versus NP 
problem. 

In the last years several computing models using powerful and inherent tools 
inspired from nature have been developed (because of this reason, they are known 
as hio-inspired models) and several solutions in polynomial time to problems 
from the class NP have been presented, making use of non-determinism or of 
an exponential amount of space. This is the reason why a practical implemen- 
tation of such models (in biological, electronic, or other media) could provide a 
quantitative improvement for the resolution of NP-complete problems. 

In this work we focus on one of these models, the cellular computing model 
with membranes, specifically, on one of its variants, the language accepting P 
systems, in order to develop a computational complexity theory allowing us to 
attack the P versus NP problem from other point of view than the classical 
one. 

The paper is structured as follows. The next section is devoted to the def- 
inition of language accepting P systems. In section 3 a polynomial complexity 
class for the above model is introduced. Sections 4 and 5 provides simulations of 
deterministic Turing machines by P systems and language accepting P systems 
by deterministic Turing machines. Finally, in section 6 we establish a character- 
ization of the P versus NP problem through P systems. 



2 Language Accepting P Systems 

Until the end of 90’s decade several natural computing models have been in- 
troduced simulating the way nature computes at the genetic level (genetic al- 
gorithms and DNA based molecular computing) and at the neural level (neural 
networks). In 1998, Gh. Paun [5] suggests a new level of computation: the cellular 
level. 
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Cells can be considered as machines performing certain computing processes; 
in the distributed framework of the hierarchical arrangement of internal vesicles, 
the communication and alteration of the chemical components of the cell are 
carried out. Of course, the processes taking place in the cell are complex enough 
for not attempting to completely model them. The goal is to create an abstract 
cell-like computing model allowing to obtain alternative solutions to problems 
which are intractable from a classical point of view. 

The first characteristic to point out from the internal structure of the cell is 
the fact that the different units composing the cell are delimited by several types 
of membranes (in a broad sense): from the membrane that separates the cell 
from the environment into which the cell is placed, to those delimiting the inner 
vesicles. Also, with regard to the functionality of these membranes in nature, it 
has to be emphasized the fact that they do not generate isolated compartments, 
but they allow the chemical compounds to flow between them, sometimes in 
selective forms and even in only one direction. Similar ideas were previously 
considered, for instance, in [1] and [3]. 

P systems are described in [4] as follows: a membrane structure consists of 
several membranes arranged in a hierarchical structure inside a main membrane 
(called the skin) and delimiting regions (each region is bounded by a mem- 
brane and the immediately lower membranes, if there are any). Regions contain 
multisets of objects, that is, sets of objects with multiplicities associated with 
the elements. The objects are represented by symbols from a given alphabet. 
They evolve according to given evolution rules, which are also associated with 
the regions. The rules are applied non-deterministically, in a maximally parallel 
manner (in each step, all objects which can evolve must do so). The objects 
can also be moved (communicated) between regions. In this way, we get tran- 
sitions from one configuration of the system to the next one. This process is 
synchronized: a global clock is assumed, marking the time units common to all 
compartments of the system. A sequence (finite or infinite) of transitions be- 
tween configurations constitutes a computation; a computation which reaches 
a configuration where no rule is applicable to the existing objects is a halting 
computation. With each halting computation we associate a result, by taking 
into consideration the objects collected in a specified output membrane or in the 
environment. 

For an exhaustive overview of transition P systems and of their variants and 
properties, see [4]. 

Throughout this paper, we will study the capacity of cellular systems with 
membranes to attack the efficient solvability of presumably intractable decision 
problems. We will focus on a specific variant of transition P systems: language 
accepting P systems. These systems have an input membrane, and work in such 
a way that when introducing in the input membrane a properly encoded string, 
a “message” is sent to the environment, encoding whether this string belongs or 
not to a specified language. 
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Definition 1. A membrane structure is a rooted tree, where the nodes are called 
membranes, the root is called skin, and the leaves are called elementary mem- 
branes. 



Definition 2. Let /i = he a membrane structure. The membrane 

structure with external environment associated with fj, is the rooted tree such 
that: (a) the root of the tree is a new node that we denote by env; (b) the set of 
nodes is P(/i) U {env}; and (c) the set of edges is E{yL) U {{env , skin}} . 

The node env is called environment of the structure fi. So, every membrane 
structure has associated in a natural way an environment. 

Definition 3. A language accepting P system (with input membrane and ex- 
ternal output) is a tuple 

n E, A, , fj,jj , Ad 1 , . .. , Adp, {El , pi {, .. ., (Ep , Pp}, ijj) 
verifying the following properties: 

— The input alphabet of II is E. 

— The working alphabet of II is T, with E C E and € E — E . 

— Pjj is a membrane structure consisting of p membranes, with the membranes 
(and hence the regions) injectively labelled with 1,2,. .. ,p. 

— in is the label of the input membrane. 

— The output alphabet of II is A= {Yes, No}. 

— Ail, ..., Adp are multisets over E—E, representing the initial contents of the 
regions of 1,2, ...,p of p„. 

— El, ..., Ep are finite sets of evolution rules over E associated with the regions 
1 , 2 ,..., p of p„. 

— Pi, 1 < i < p, are partial order relations over Ei specifying a priority relation 
among rules of Ei . 

An evolution rule is a pair (u, v), usually represented u ^ v, where tt is a string 
over E and v = v' or v = v'S, with v' a string over 

E X ({/lere, out} U {iui | i = 1, . . . ,p}) . 

Consider a rule u v from a set Ei. To apply this rule in membrane i means 
to remove the multiset of objects specified by u from membrane i (the latter 
must contain, therefore, sufficient objects so that the rule can be applied), and 
to introduce the objects specified by v, in the membranes indicated by the target 
commands associated with the objects from v. 

Specifically, for each (a, out) S u an object a will exit the membrane i and 
will become an element of the membrane immediately outside it (that is, the 
father membrane of membrane i), or will leave the system and will go to the en- 
vironment if the membrane i is the skin membrane. If v contains a pair (a, here), 
then the object a will remain in the same membrane i where the rule is applied 
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(when specifying rules, pairs (a, here) are simply written a, the indication here 
is omitted). For each (a, inj) € v an object a should be moved in the membrane 
with label j, providing that this membrane is immediately inside membrane i 
(that is, membrane i is the father of membrane j); if membrane j is not directly 
accesible from membrane i (that is, if membrane j is not a child membrane of 
membrane i), then the rule cannot be applied. Finally, if h appears in v, then 
membrane i is dissolved; that is, membrane i is removed from the membrane 
structure, and all objects and membranes previously present in it become el- 
ements of the immediately upper membrane (the father membrane) while the 
evolution rules and the priority relations of the dissolved membrane are removed. 
The skin membrane is never dissolved; that is, no rule of the form u v' 6 is 
applicable in the skin membrane. 

All these operations are done in parallel, for all possible applicable rules 
u —>■ V, for all occurrences of multisets u in the membrane associated with the 
rules, and for all membranes at the same time. 

The rules from the set Ri, 1 < i < p, are applied to objects from membrane 
i synchronously, in a non-deterministic maximally parallel manner; that is, we 
assign objects to rules, non-deterministically choosing the rules and the objects 
assigned to each rule, but in such a way that after this assignation no further 
rule can be applied to the remaining objects. Therefore, a rule can be applied in 
the same step as many times as the number of copies of objects allows it. 

On the other hand, we interpret the priority relations between the rules in 
a strong sense', a rule u ^ v in a set Ri can be used only if no rule of a higher 
priority exists in Ri and can be applied at the same time with u ^ v. 

A configuration of 77 is a tuple {p, Me, Mi,, , ■ ■ ■ , Mifi), where pis a membrane 
structure obtained by removing from all membranes different from i\, . . . ,iq 
(of course, the skin membrane cannot be removed). Me is the multiset of objects 
contained in the environment of p, and Mi- is the multiset of objects contained 
in the region ij. 

For every multiset m over E (the input alphabet of the P system), the initial 
configuration of 77 with input m is the tuple (/i^,0, A4i, ...,M.i„ U m, ..., A4p). 
That is, in any initial configuration of 77 the environment is empty. We will 
denote by In the collection of possible inputs for the system 77. 

Given a configuration G of a P system 77, applying properly the evolution 
rules as described above, we obtain, in a non-deterministic way, a new configu- 
ration C . We denote by C C , and we say that we have a transition from C 
to Gb A halting configuration is a configuration in which no evolution rule can 
be applied. 

A computation C of a P system is a sequence of configurations, {G*}i<r, 
where: G° is an initial configuration of the system; G* for every 7 < r; 

and, either r S N"*" (that is, it is a non-zero natural number) and G'’“^ is a 
halting configuration, or r = oo, in which case it is said that C is not halting. 

For a computation C = {G*}i<r. we will denote by Mfi the content of the 
environment in the configuration . Next we define the output of the P system. 
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Definition 4. The output of a computation C = is: 

{ Yes, if C is halting, Y es G and No ^ , 

No, if C is halting, No G and Yes ^ , 

not defined, otherwise. 

If C satisfies any of the two first conditions, then we say that it is a successful 
computation. 



Definition 5. A language accepting P system is said to be valid if every halting 
computation is a successful computation and every halting computation, and only 
them, sends out the symbol ff (and only in the last step). 

We denote by CA the class of valid language accepting P systems. 

Next we define what it means that such P systems accept or decide a lan- 
guage. 

Definition 6. Let L be a language over an alphabet 12. We say that the system 
n G LA accepts the language L if the following properties are verified: 

— There exists a total function, cod : Q* In, computable and injective, 
encoding strings over f2 by means of multisets over the input alphabet of II . 

— For every string w G 12* it is verified that: 

• If w G L, then there exists a computation C of II with input cod{w) such 
that C is halting and Output{C) = Yes. 

• If there exists a computation C of II with input cod{w) such that C is 
halting and Output{C) = Yes, then w G L. 



Definition 7. Let L be a language over an alphabet f2. We say that the system 
n G LA decides the language L if the following properties are verified: 

— Every computation of II is halting. 

— There exists a total function, cod : f2* In, computable and injective, 
encoding strings over f2 by means of multisets over the input alphabet of IT . 

— For every string w G f2* it is verified that: 

• If w G L, then for every computation C of II with input cod{w) it is 
verified that Output{C) = Yes. 

• If w ^ L, then for every computation C of II with input cod{w) it is 
verified that Output{C) = No. 

3 A Polynomial Complexity Class in Cellular Systems 

In order to give a formal definition of computational complexity classes in this 
model, we have to first specify what we mean by a decision problem. 
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Definition 8. A decision problem, X, is a pair (lx, Ox) such that Ix is a lan- 
guage (over a finite alphabet) whose elements are called instances of the problem 
and Ox is a total Boolean function over Ix- 

A decision problem X is solvable by a Turing machine TM if Ix is the set 
of inputs of TM, for any w £ Ix the Turing machine halts over w, and w is 
accepted if and only if Ox{w) = 1. 

To solve a problem by means of P systems, we usually construct a family of 
such devices so that each element decides the instances of equivalent size, in a 
certain sense which will be specified below. 

Definition 9. Let g : N+ ^ N+ be a total computable function. We say that a 
decision problem X is solvable by a family of valid language accepting P systems, 
in a time bounded by g, and we denote this by X € MCcA{g)> if there exists a 
family of P systems, 11 = {ll{n))^^^^, with the following properties: 

1. For every n it is verified that II{n) G LA. 

2. There exists a Turing machine constructing II (n) from n in polynomial time 
(we say that II is polynomially uniform by Turing machines) . 

3. There exist two functions, cod : Ix Un6N+ ^n(n) o,nd s : Ix ^ N"*", 
computable in polynomial time, such that: 

- For every w G Ix, cod{w) G In{s{w)). 

— The family II is bounded, with regard to {X,cod, s, g); that is, for each 
w G Ix every computation of the system II{s{w)) with input cod{w) is 
halting and, moreover, it performs at most (/(|w|) steps. 

— The family II is sound, with regard to {X, cod, s); that is, for each w G Ix 
if there exists an accepting computation of the system II{s{w)) with input 
cod{w), then 0x{w) = 1- 

— The family II is complete, with regard to (X,cod,s); that is, for each 
w G Ix if 0x{w) = 1, then every computation of the system II{s{w)) 
with input cod{w) is an accepting computation. 

Note that we impose a certain kind of confluence of the systems, in the sense 
that every computation with the same input must return the same output. 

As usual, the polynomial complexity class is obtained using as bounds the 
polynomial functions. 

Definition 10. The class of decision problems solvable in polynomial time by a 
family of cellular computing systems belonging to the class LA, is 

pmCca= U MCcA(g). 

g poly. 



This complexity class is closed under polynomial-time reducibility. 

Proposition 1. Let X and Y be two decision problems such that X is poly- 
nomial-time reducible to Y . IfY G PMC£_ 4 , then X G PMC^^. 
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4 Simulating Deterministic Turing Machines by P 
Systems 

In this section we consider deterministic Turing machines as language decision 
devices. That is, the machines halt over any string on the input alphabet, with 
the halting state equal to the accepting state, in the case that the string belongs 
to the decided language, and with the halting state equal to the rejecting state 
in the case that the string does not belong to the language. 

It is possible to associate with a Turing machine a decision problem, and 
this will permit us to define what means that such a machine is simulated by a 
family of P systems. 

Definition 11. Let TM he a Turing machine with input alphabet Stm- The 
decision problem associated with TM is the problem Xtm = {1,0), where I = 
^TM’ for every w G 9{w) = 1 if and only ifTM accepts w. 

Obviously, the decision problem Xtm is solvable by the Turing machine TM . 

Definition 12. We say that a Turing machine TM is simulated in polynomial 
time by a family of systems of the class LA, if Xtm G PMC^^q. 

Next we state that every deterministic Turing machine can be simulated in 
polynomial time by a family of systems of the class LA. 

Proposition 2. Let TM he a deterministic Turing machine working in polyno- 
mial time. Then Xtm G PMC^^. 

See chapter 9 of [8], which follows ideas from [9], for details of the proof. 

5 Simulating Language Accepting P Systems by 
Deterministic Turing Machines 

In this section we are going to prove that if a decision problem can be solved in 
polynomial time by a family of language accepting P systems, then it can also 
be solved in polynomial time by a deterministic Turing machine. 

For the design of the Turing machine we were inspired by the work of C. 
Zandron, C. Ferretti and G. Mauri [11], with the difference that the mentioned 
paper deals with P systems with active membranes. 

Proposition 3. For every decision problem solvable in polynomial time by a 
family of valid language accepting P systems, there exists a Turing machine 
solving the problem in polynomial time. 

Proof. Let X be a decision problem such that X G PMC^^. Then, there exists 
a family of valid language accepting P systems 11 = {n{n))^^j^^ such that: 
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1. The family II is polynomially uniform by Turing machines. 

2. There exist two functions cod : Ix Un6N+ ^n(n) and s : Ix ^ N'*', 
computable in polynomial time, such that: 

- For every w G Ix, cod{w) G In(s(w))- 

— The family II is polynomially bounded, with regard to {X, cod, s). 

— The family II is sound and complete, with regard to (X, cod, s). 

Given n G N"*", let An be the number of symbols in the input alphabet of 
n(n). Bn the number of symbols in the working alphabet, (7„ the number of 
symbols in the output alphabet, the number of membranes, En the maximum 
size of the multisets initially associated with them, Fn the total number of rules 
of the system, and the maximum length of them. Since the family II is 
polynomially uniform by Turing machines, these numbers are polynomial with 
respect to n. 

Let m be an input multiset of the system U (n) . Given a computation C of 
n{n) with input m, we denote by iL„(m) the maximum number of digits, in 
base 2, of the multiplicities of the objects contained in the multisets associated 
with the membranes of the systems and with the environment, in any step of C. 
Naturally, this number depends on C, but what we are interested in, and we will 
prove at the end of the proof, is that any computation of the system II{s{w)) 
with input cod{w) verifies that H s(w){cod{w)) is polynomial in the size of the 
string w. 

Next, we associate with the system 77 (n) a deterministic Turing machine, 
TM{n), with multiple tapes, such that, given an input multiset m of II (ri), the 
machine reproduces a specific computation of 7T(n) over m. 

The input alphabet of the machine TM{n) coincides with that of the system 
n{n). On the other hand, the working alphabet contains, besides the symbols 
of the input alphabet of 7T(n) the following symbols: a symbol for each label as- 
signed to the membranes of 7T(n); the symbols 0 and 1, that will allow to operate 
with numbers represented in base 2; three symbols indicating if a membrane has 
not been dissolved, has to be dissolved or has been dissolved; and three symbols 
that will indicate if a rule is awaiting, is applicable or is not applicable. 
Subsequently, we specify the tapes of this machine. 

— We have one input tape, that keeps a string representing the input multiset 
received. 

— For each membrane of the system we have: 

• One structure tape, that keeps in the second cell the label of the father 
membrane, and in the third cell one of the three symbols that indicate if 
the membrane has not been dissolved, if the membrane has to dissolve, 
or if the membrane has been dissolved. 

• For each object of the working alphabet of the system: 

* One main tape, that keeps the multiplicity of the object, in base 2, 
in the multiset contained in the membrane. 

* One auxiliary tape, that keeps temporary results, also in base 2, of 
applying the rules associated with the membrane. 
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• One rules tape, in which each cell starting with the second one corre- 
sponds to a rule associated with the membrane (we suppose that the set 
of those rules is ordered), and keeps one of the three symbols that indi- 
cate whether the rule is awaiting, it is applicable, or it is not applicable. 

— For each object of the output alphabet we have: 

• One environment tape, that keeps the multiplicity of the object, in base 
2, in the multiset associated with the environment. 

Next we describe the steps performed by the Turing machine in order to 
simulate the P system. Take into account that, making a breadth first search 
traversal (with the skin as source) on the initial membrane structure of the 
system II(n), we obtain a natural order between the membranes of II (n). In the 
algorithms that we specify below we consider that they always traverse all the 
membranes of the original membrane structure and that, moreover, they do it 
in the order induced by the breadth traversal of that structure. 

I. Initialization of the system. In the first phase of the simulation process fol- 
lowed by the Turing machine the symbols needed to reflect the initial configura- 
tion of the computation with input m that is going to be simulated are included 
in the corresponding tapes. 

for all membrane mb of the system do 

if mb is not the skin membrane then 

- Write in the second cell of the structure tape of mb the label corres- 
ponding to the father of mb 

end if 

- Mark mb as non-dissolved membrane in the third cell of the structure 
tape of mb 

for all symbol ob of the working alphabet do 

- Write in the main tape of mb for ob the multiplicity, in base 2, of o6 
in the multiset initially associated with mb 

end for 

end for 

for all symbol ob of the input alphabet do 

- Read the multiplicity, in base 2, of o6 in the input tape 

- Add that multiplicity to the main tape of the input membrane for ob 

end for 

II. Determine the applicable rules. To simulate a step of the cellular computing 
system, what the machine has to do first is to determine the set of rules that are 
applicable (each of them independently) to the configuration considered in the 
membranes they are associated with. 

for all membrane mb of the system do 
if mb has not been dissolved then 
for all rule r associated with mb do 
- Mark r as awaiting rule 

end for 
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for all rule r associated with mb do 
if- r is awaiting and 

- mb contains the antecedent of r and 

- r only sends objects to child membranes of mb that have not been 
dissolved and 

- r does not try to dissolve the skin membrane 

then 

- Mark r as applicable rule 

for all rule r' associated with mb of lower priority than r do 
- Mark r' as non-applicable rule 

end for 
else 

- Mark r as non-applicable rule 

end if 
end for 
end if 
end for 

III. Apply the rules. Once the applicable rules are determined, they are applied 
in a maximal manner to the membranes they are associated with. The fact 
that the rules are considered in a certain order (using local maximality for each 
rule, according to that order) determines a specific applicable multiset of rules, 
thus fixing the computation of the system that the Turing machine simulates. 
However, from Definition 9 of complexity class it will follow that the chosen 
computation is not relevant for the proof, due to the confluence of the system. 

for all membrane mb of the system do 
if mb has not been dissolved then 
for all rule r associated with mb that is applicable do 
for all object ob in the antecedent of r do 

-Compute the integer quotient that results from dividing the mul- 
tiplicity of ob in the main tape of mb by the multiplicity of ob in 

the antecedent of r 

end for 

- Compute the minimum of the values obtained in the previous loop 
(that minimum is the maximum number of times that the rule r 
can be applied to membrane mb). Let us call it index of the rule r. 
for all object ob in the antecedent of r do 

- Multiply the multiplicity of ob in the antecedent of r by the index 
of r 

- Erase the result obtained from the main tape of mb for the 
object ob 

end for 

for all object ob in the consequent of r do 

- Multiply the multiplicity of ob in the consequent of r by the index 
of r 
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- Add the result obtained to the auxiliary tape for ob in the corre- 
sponding membrane 

end for 

if r dissolves mb then 

- Mark mb as membrane to dissolve in the third cell of the structure 
tape of mb 

end if 
end for 
end if 
end for 

IV. Update the multisets. After applying the rules, the auxiliary tapes keep the 
results obtained, and then these results have to be moved to the corresponding 
main tapes. 

for all membrane mb of the system do 
if mb has not been dissolved then 

- Copy the content of the auxiliary tapes of mb into the corresponding 
main tapes 

end if 
end for 

V. Dissolve the membranes. To finish the simulation of one step of the computa- 
tion of the P system it is necessary to dissolve the membranes according to the 
rules that have been applied in the previous phase and to rearrange accordingly 
the structure of membranes. 

for all membrane mb of the system do 
if - mb has not been dissolved and 

- the father of mb is marked as membrane to dissolve 

then 

- Make the father of mb equal to the father of the father of mb 

end if 
end for 

for all membrane mb of the system do 
if mb is marked as membrane to dissolve then 

- Copy the contents of the main tapes of mb into the main tapes of 
the (possibly new) father of mb 

- Mark mb as dissolved membrane in the third cell of the structure 
tape of mb 

end if 
end for 

VI. Check if the simulation has ended. Finally, after finishing the simulation of 
one transition step of the computation of II (n), the Turing machine has to check 
if a halting configuration has been reached and, in that case, if the computation 
is an accepting or a rejecting one. 
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if the environment tape contains the symbol # then 
if the environment tape contains the symbol V es then 

- Halt and accept the multiset m 
else 

- Halt and reject the multiset m 

end if 
else 

- Simulate again a step of the computation of the P system 

end if 

It is easy to check that the family can be constructed in an 

uniform way and in polynomial time from n € N"*". 

Let us finally consider the deterministic Turing machine TMjj that works as 
follows: 

Input: w G Ix 

- Compute s{w) 

- Construct TM{s{w)) 

- Compute cod{w) 

- Simulate the functioning of TM{s{w)) with input cod{w) 

Then, the following assertions are verified: 

1. The machine TMn works in polynomial time over |rt;|. 

Since the functions cod and s are polynomial in |rt;|, the numbers 
^s(w)'i ^s(w)'i ^s{w)j and are polynomial in |rc|. 

On the other hand, the family 11 is polynomially bounded, with regard 
to {X,cod, s). Therefore, every computation of the system II{s{w)) with 
input cod{w) performs a polynomial number of steps on |r(;|. Consequently, 
the number of steps, performed by the computation simulated by the 
machine TM{s{w)) over cod{w) is polynomial in 

Hence, the maximal multiplicity of the objects contained in the multisets 
associated with the membranes is in the order of 0{Eg(w) ' G^(“w))- This 
implies that Hg(^^'f{cod{w)) is in the order of 0{Pw ■log 2 {Egi^yj) ■Gs(w)))', that 
is, polynomial in |r/;|. 

It follows that the total time spent by TMu when receiving w as input is 
polynomial in |rc|. 

2. Let us suppose that TMu accepts the string w. Then the computation of 
n{s{w)) with input cod{w) simulated by TM{s{w)) is an accepting compu- 
tation. Therefore 9x{w) = 1- 

3. Let us suppose that 9x{w) = 1. Then every computation of n{s{w)) with 
input cod{w) is an accepting computation. Therefore, it is also the compu- 
tation simulated by TM{s{w)). Hence TMu accepts the string w. 



Consequently, we have proved that TMu solves X in polynomial time. 
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6 Characterizing the P 7^ NP Relation Through P 
Systems 

Next, we establish characterizations of the P ^ NP relation by means of the 
polynomial time unsolvability of NP-complete problems by families of language 
accepting P systems. 

Theorem 1. The following propositions are equivalent: 

1. P ^ NP. 

2. 3X{X is an NP-complete decision problem f\ X ^ PMC£_ 4 ). 

3. yx[X is an NP-complete decision problem X ^ PMCc^). 

Proof. To prove the implication 1 3, let us suppose that there exists an NP- 

complete problem X such that X G PMC£_ 4 . Then, from Proposition 3 there 
exists a deterministic Turing machine solving the problem X in polynomial time. 
Hence, X € P. Therefore, P = NP, which leads to a contradiction. 

The implication 3 2 is trivial, because the class of NP-complete problems 

is non empty. 

Finally, to prove the implication 2 1, let X be an NP-complete problem 

such that X ^ PMCc^. Let us suppose that P = NP. Then X G P. Therefore 
there exists a deterministic Turing machine TM that solves the problem X in 
polynomial time. 

By Proposition 2, the problem Xtm is in PMC£_ 4 . Then there exists a family 
IItm = of valid language accepting P systems simulating TM 

in polynomial time (with associated functions codrAi and stm)- 

We consider the function codx : Ix Ufc6N+ given by codx{w) = 

codrAiiw), and the function sx '■ Ix ^ N'*', given by sx(w) = licl. Then: 

— The family TT-tm is polynomially uniform by Turing machine, and polyno- 
mially bounded, with regard to {X, codx, sx)- 

— The family Utm is sound, with regard to {X, codx, sx)- Indeed, let w G Ix 
be such that there exists a computation of the system IItm{sx{w)) = 
IItm{stm{w)) with input codx{w) = codrAiiw) that is an accepting com- 
putation. Then 6tm{w) = 1- Therefore 9x{w) = 1- 

— The family Utm is complete, with regard to {X, codx,sx)- Indeed, let w G 
Ix be such that Ox{w) = 1. Then TM accepts the string w. Therefore 
Otm{w) = 1- Hence, every computation of the system IItm{stm{w)) = 
IItm{sx{w)) with input codTM{w) = codx{w) is an accepting computation. 

Consequently, X G PMC£_ 4 , and this leads to a contradiction. 
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Abstract. We propose a theoretical model for performing gate opera- 
tions - OR, AND, and NOT - using peptide-antibody interactions. The 
idea is extended to further gates, such as XOR, N AND, and NOR. 



1 Introduction 

With unconventional models of computing holding a central stage in this modern 
computing world, a lot of work is going on to define new bio-computing models 
to perform computations efficiently. 

Peptide-antibody interactions are very natural and systematic and they can 
be used as a basis for a computational model. This way of computing using 
antibodies which specifically recognize peptide sequences was introduced by H. 
Hug et al in [3], where one solves the well-known NP Complete problem SAT. 
In [2], it the computational completeness of peptide computing was proven and 
it was shown how to solve other two well-known NP-complete problems, namely 
the Hamiltonian path problem and the exact cover by 3-sets problem (a variation 
of the set cover problem) using the interactions between peptides and antibodies. 

A peptide is a sequence of aminoacids attached by covalent bonds called 
peptide bonds. A peptide consists of recognition sites called epitopes for the 
antibodies. A peptide can contain more than one epitope for the same or dif- 
ferent antibodies. For each antibody which attaches to a specific epitope there 
is a binding power associated with it called affinity. If more than one antibody 
participate in recognition of its sites which overlap in the given peptide, then 
the antibody with a greater affinity has a higher priority. 

In the model proposed in [2,3] the peptides represent the sample space of a 
given problem and antibodies are used to select certain subsets of this sample 
space, which will eventually give the solution for the given problem. Similar 
to DNA-computing, parallel interactions between the peptide sequences and the 
antibodies should make it possible to solve NP-complete problems in polynomial 
time. 
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The specificity of an epitope recognition by an antibody need not be absolute. 
Sometimes the antibody may recognize a different site which is very similar to 
its own binding site. This issue is called cross-reactivity. This paper does not 
address the issue of cross-reactivity between antibodies and peptides. 

Using DNA strands H. Hug et al [4] propose a model to solve simple binary 
addition in parallel. In [1] we propose a model to do simple arithmetic (addition 
and subtraction) operations using peptides and antibodies. 

The organization of the paper is as follows. In the next section we define our 
proposed model of performing gate operations OR, AND and NOT. We explain 
with some examples how the operations are carried out. In section 4 our method 
is extended to do XOR, NAND and NOR gates. Our paper concludes with 
some discussion on the bio-chemical implementation of our proposed model. 



2 Proposed Model 

The proposed model uses a peptide sequence and a set of antibodies. The peptide 
sequence consists of five epitopes; if P denotes the peptide sequence, then P can 
be represented as follows: 

P = yxixx^z, 

where each y, x\, x, X 2 and z are epitopes. We take six antibodies denoted by 
Ai,A 2 ,Bi,B 2 ,Cinit and C/. Basically, 

1. the antibodies Ai,A 2 ,Bi and B 2 denote the inputs, 

2. Cinit denote the initial value of the result of the operation, 

3. Cinit and C/ denote the output of the operation, 

4. the epitopes x and y are the binding places for the antibodies denoting the 
input, 

5. the epitope xixx 2 is the place where the antibody representing the initial 
output binds, 

6. the epitopes xix, XX 2 and x are the binding places for the antibodies denoting 
the output. 

The peptide sequence P is given in Fig. 1. The peptide sequence with possible 
antibodies binding on it is represented in Fig. 2. Note that at a time only one 
antibody can be there in the overlapping epitopes. The figure is meant only to 
show the binding sites of the antibodies pictorically. 



y xj X X2 z 



Fig. 1. Peptide sequence 
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Fig. 2. Peptide sequence with possible antibodies 



3 Implementation of OR, AND, and NOT Gates 

First we will discuss OR gate implementation. AND gate is very similar to the 
OR gate implementation. 

For the OR gate we follow the representation given in the sequel: 

1. Input bits 0 and 1 are represented by the antibodies Ai and Bi respectively, 
where 1 < * < 2. 

2. The antibody Cinit denotes the bit 0. 

3. The antibody Of (labeled antibody) denotes the bit 1. 

4. epitope(Ai) = {y}, epitope{A 2 ) = {z}, 

5. epitope(Bi) = {yxi}, epitope{B 2 ) = {x 2 z}, 

6. epitope{Cinit) = {X 1 XX 2 }, epitope{Cf) = {a:}. 

7. aff{B,) > aff (Cinit) > ajf(Cf), 1 < i < 2. 

The simple idea used in the implementation of OR gate follows from the fact 
that the output 1 occurs even if only one of the inputs is 1. So if we start with 
an initial output of 0, which is done by getting the peptide sequence with the 
antibody Cinit binding on it, then it should be toggled even if only one 1 comes as 
an input. For this to be carried out we have made the epitopes for the antibody 
Cinit and the antibody Bi, 1 < i < 2, overlapping. This facilitates toggle of the 
output bit to 1. The algorithm is as follows: 

1. Take the peptide sequence P in an aqueous solution. 

2. Add the antibody Cinit- 

3. Add antibodies corresponding to the input bits. For example, if the first bit 
is 1 and the second bit is 0, then add antibodies Bi and A 2 . 

4. Add antibody Cf. 

In the above algorithm only if 1 occurs as an input bit the initial antibody 
Cinit is removed (since aff(Bi) > aff (Cinit)), which facilitates the binding of the 
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antibody C/. If both bits are 0, then the antibody Cmit is not removed. If the 
output has to be seen, then the antibody Cf can be given some color so that at 
the end of the algorithm if fluorescence is detected, then the output will be 1 or 
else it will be 0. 

The working of the OR gate model is depicted in Fig. 3. 





INPUT = 0,1 




Fig. 3. OR gate 
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AND Gate 

In the same manner as above we can implement the AND gate. The details are 
the following. 

1. The antibody Cinit denotes the bit 1. 

2. The antibody Cf (labeled antibody) denotes the bit 0. 

3. epitope(Bi) = {y}, epitope{B 2 ) = {z}, 

4. epitope(Ai) = {yxi}, epitope{A 2 } = {x 2 z}, 

5. epitope(Cinit) = {X 1 XX 2 }, epitope(Cf) = {x}. 

6. aff{Ai) > aff {Cinit) > aff{Cf), l<i<2. 

The simple idea used in the implementation of the AND gate follows from 
the fact that the output 0 occurs even if only one of the inputs is 0. The algorithm 
for the AND gate is the same as for the OR gate. The working of the AND 
gate should be easy to understand as it is very similar to the working of the OR 
gate. 

NOT Gate 

Since the NOT gate requires only one input, we take the peptide sequence as 

P = XX 2 Z 

and take only the antibodies Ai and Bi. The model is as follows: 

1. The antibody Cinit denotes the bit 0. 

2. The antibody Cf (labeled antibody) denotes the bit 1. 

3. epitope{B 2 ) = {z}, 

4. epitope{A2) = {x2z}, 

5. epitope{Cinit) = {XX 2 }, epitope(Cf) = {x}. 

6. aff{Ai) > aff {Cinit) > ajJ{Cf), l<i<2. 

The algorithm for NOT gate is as follows: 

1. Take the peptide sequence P in an aqueous solution. 

2. Add the antibody Cinit- 

3. Add antibody corresponding to the input bit. 

4. Add antibody Cf. 

It can be noted that the initial bit denoting 0 is toggled only if the input bit 
is 0. 



4 Extension to XOR, NOR, and NAND Gates 

The same idea used in the implementation of the OR and AND gates is extended 
to XOR, NOR and NAND. First we take the XOR gate; the other two gates 
follow from OR and AND gates very easily. 
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The model for the XOR gate requires little change in the proposed model 
defined for other gates in the previous section. The peptide sequence is the same 
but this gate requires one more antibody and the binding sites for the output 
antibodies are different. It should be easy to follow from the picture in Fig. 4. 




Fig. 4. Peptide sequence with possible antibodies 



The construction for the XOR gate is as follows: 

1. Input bits 0 and 1 are represented by the antibodies Ai and Bi respectively 
where 1 < * < 2. 

2. The antibodies Cinit and Co denotes the bit 0. 

3. The antibody C/ (labeled antibody) denotes the bit 1. 

4. epitope(Ai) = {y}, epitope{A 2 ) = {z}, 

5. epitope{Bi) = {yxi}, epitope{B 2 ) = {x 2 z}, 

6. epitope(Cinit) = {X 1 XX 2 }, epitope(Cf) = {xix, XX 2 } axid epitope{Co) = 

7. aff{B,) > aff (Cinit) > aff(Cf) > aff(Co), 1 < t < 2. 

The simple idea used in the implementation of the XOR gate is that whenever 
the inputs are the same, either the epitope X 1 XX 2 is bounded by the antibody 
Cinit or the epitope x will be free for the antibody Cq to come and bind to it. 
This makes sure that the output is 0 whenever the inputs are not the same. 
When the inputs are different either the epitope a;ia; or XX 2 will be free. So that 
the antibody C/ can bind to either of them. This will guarantee that the output 
will be 1 when the inputs are different. The algorithm is given below: 

1. Take the peptide sequence P in an aqueous solution. 

2. Add the antibody Cinit- 

3. Add antibodies corresponding to the input bits. 

4. Add antibody C/. 

5. Add antibody Cq. 
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NOR Gate 

If we perform the following changes in the OR gate construction 

1. Cinit denotes the bit 1 and 

2. Cf denotes 0, 

then it should be easy to note that the switching operation represents the NOR 
function. The idea is that even if one 1 comes as an input, the output is 0. It 
should be easy to check that the above construction works exactly as the NOR 
gate. 

NAND Gate 

In the AND gate construction we perform the following changes: 

1. The antibody Cinit denotes the bit 0. 

2. The antibody Of (labeled antibody) denotes the bit 0. 

The idea is that even if one 0 comes as an input, the output is 1. It should be 
easy to check that the above construction works exactly as NAND gate. 

5 Remarks 

In this work we have proposed a model to do simple switching operations. Basi- 
cally there are two ways of getting the output. One way is just detection, and the 
other is to decipher bio-chemically the output. For detection, we can give a label 
to either of the output antibody, so that at the end of the process, depending 
on whether the fluorescence is detected or not we can detect the answer. 

There are many biochemical methods to decipher the interaction of peptides 
with antibodies [5]. To extract the numbers from the peptide-antibody inter- 
acting system. Nuclear Magnetic Resonance Spectroscopy (NMR) is one of the 
two powerful tools to map the binding regions of peptide and/or proteins, at 
atomic level, during their interactions; the other tool is X-ray crystallography. 
Depending upon the binding affinity of the peptide with antibody, there are sev- 
eral NMR techniques which can be used for this purpose [7]. If the antibody is 
spin labeled, then mapping of the peptide binding domain in antibody can also 
be accomplished by Surface Activity relationship (SAR) technique [6]. 

It may be interesting to see whether this model can be implemented in a lab- 
oratory. It may also be worthwhile to determine how this model can be extended 
to switching operations on strings of bits with several strings. 
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Abstract. Tom Head, [1] has given a simple and elegant method of 
aqueous computing to solve 3SAT. The procedure makes use of Divide- 
Delete-Drop operations performed on plasmids. In [4], a different set of 
operations, Cut-Expand-Ligate, are used to solve several NP-Complete 
problems. In this paper, we combine the features in the two procedures 
and define Cut-Delete-Expand-Ligate which is powerful enough to solve 
#3SAT, which is a counting version of 3SAT known to be in IP. The 
solution obtained is advantageous to break the propositional logic based 
cryptosystem introduced by J. Kari [5]. 



1 Plasmids for Aqueous Computing 

DNA molecules occur in nature in both linear and circular form. Plasmids are 
small circular double-stranded DNA molecules. They carry adequate informa- 
tion encoded in their sequences necessary for their replication and this is used in 
genetic engineering. The technique used is cut and paste - cutting by restriction 
enzymes and pasting by a ligase. The sequential application of a set of restric- 
tion enzymes acting at distinct non-overlapping, different sites in circular DNA 
molecules is fundamental to the procedure suggested below. 

2 Divide-Delete-Drop (D-D-D) 

Tom Head [ 1 ] has given the following procedure which provides the correct 
YES/NO answer for instances of 3 -SAT in a number of steps linearly bounded 
by the sum of the number of atomic propositional variables and the number of 
triples that are disjoined. 

The fundamental data structure for the computational work involved in D- 
D-D is an artificial plasmid constructed as follows. 

For a specified set of restriction enzymes {REi, RE2, ■ ■ ■ , REn}, the plasmid 
contains a segment of the form C1S1C1C2S2C2 . . . CiSiCi . . . c„_is„_ic„_ic„s„c„; 
the subsegments ci , C2 , . . . , c„ are sites at which the enzymes REi , , REn can 
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cut the plasmid. These are the only places where the enzymes can cut the plas- 
mid. The subsegments si, S 2 , . . . , s„ are of fixed length and are called stations. 
The sequence for each station is to be chosen so that no other subsegment of the 
plasmid has the same base pair sequence. 

The key computational operation used in plasmids is the compound operation 
consisting of the removal of one station from a (circular) plasmid followed by 
recircularizing the linear molecule. This is a cut and paste process. The net 
effect of the compound operation is the removing of the station. The compound 
biochemical deletion process is Delete(Station-Name). Two further operations 
used are Drop and Divide. 

3 Procedure Cut-Expand-Light (C-E-L) to Solve 
Satisfiability of Sets of Boolean Clauses 

In Binghamton, Tom Head, S. Gal and M. Yamamura provided a prototype 
solution of a three variables, four clauses satisfiability (SAT) problem as an 
aqueous computation. 

The memory register molecule was chosen to be a commercially available plas- 
mid. Restriction enzyme sites were chosen in the Multiple Cloning Site (MCS) 
of the plasmid, each of which serving as a station for the computation. 

The computation begins with a test-tube of water (for appropriate buffer) 
that contains a vast number of identical, n-station plasmids. During a compu- 
tation, the plasmids are modified so as to be readable later. Modifications take 
place only at the stations. Each station s, at any time during the computation, is 
in one of the two stations representing one of the bits 1 or 0. Thus, each plasmid 
plays a role corresponding to that of an n-bit data register in a conventional 
computer. 

The initial condition of a station (restriction enzyme site) represents the bit 1 . 
A zero is written at a station by altering the site making use of the following 
steps: 

1. Linearize the plasmid by cutting it at the station (site), with the restriction 
enzyme associated with it. 

2. Using a DNA polymerase, extend the 3' ends of the strands lying under the 
5' overhangs to produce a linear molecule having blunt ends. 

3. When a station is altered to represent the bit zero, the length of the station is 
increased and the station no longer encodes a site for the originally associated 
enzyme. 

3.1 Satisfiability of Sets of Boolean Clauses 

The SAT instance solved in the web lab was to find a truth assignment for the 
variables p, q, r for which each of the four clauses 

p /\ q, ^p A q A ~^r, ~^q A ^r, p Ar 



evaluates to true. 
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A commercially available cloning plasmid was used as memory register mol- 
ecule. Six restriction enzyme sites in the multiple cloning site (MCS) of the 
plasmid were chosen to serve as stations for the computation. The first step is 
to remove contradictions. At the end, there are plasmids that represent only 
logically consistent truth assignments for the variables and their negations. 

The next step involves elimination. Molecules that do not encode truth as- 
signments to each one of the clauses are eliminated, one clause after the other. 
After the elimination phase, the remaining plasmids are consistent with truth 
settings for which all the given clauses have the value true. 

4 Cut-Delete-Expand-Ligate (C-D-E-L) 

The features in Divide-Delete-Drop (D-D-D, [1]) and Cut-Expand-Ligate (C- 
E-L, [2,3,4]) Tom Head procedures are combined to form Cut-Delete-Expand- 
Ligate (C-D-E-L). This enables us to get an aqueous solution to #3SAT which 
is a counting problem and known to in IP. 

An algorithm for a counting problem takes as input an instance of a decision 
problem and produces as output a non-negative integer that is the number of 
solutions for that instance. 

In the 3SAT problem, we are required to determine whether a given 3CNF is 
satisfiable. Here we are interested in a counting versions of this problem called 
#3SAT. Given a 3CNF formula F and an integer s, we need to verify that the 
number of distinct satisfying truth assignments for F is s. Echivalently, we have 
to find the number of truth assignments that satisfy F. 

4.1 #3SAT 

Instance: A propositional formula of the form F = Ci A C 2 A . . . A Cm, where 
Ci,i = 1,2,..., TO are clauses. Each Ci is of the form V V U^), where 
lij,j = 1, 2, 3, are literals for the set of variables {x\,X 2 , . . . , a;„}. 

Question: What is the number of truth assignments that satisfy F7 
It is known that 3SAT is in IP and IP = PSPACE. 

Computation: As in C-E-L, only one variety of plasmid is used, but trillions 
of artificial plasmids containing multiple cloning sites (MCS) are provided for 
the initial solution. MCS consists of a number of sites identified by different 
restriction enzymes and determined by the number of variables. All bio-molecular 
operations are done on MCS. As in D-D-D, the segment of plasmid used is of 
the form 

Cl Si Cl . . . C 2 S 2 C 2 ■ ■ ■ CnSnCrif 

where Ci,i = 1, . . . , n, are the sites such that no other subsequence of the plasmid 
matches with this sequence, Si,i = 1, . . . , n, are the stations. 

In D-D-D, the lengths of the stations were required to be the same, while 
one of the main differences in C-D-E-L is that the lengths of the stations are all 
required to be different. This is fundamental to solving ^3SAT. Biomolecular 
operations used in our C-D-E-L procedure are similar to those used in C-E-L. 
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Design. Let Xi, . . . ,Xn be the variables in F, -^x\, . . . , ~^Xn their negations, 
Ci a site associated with station Si, ->Ci a site associated with station ^Si, rt the 
length of the station associated with Xi,i = 1, ... ,n, and Vn+j the length of the 
station associated with literal —•Xj,j = 1, ... ,n. 

We chose stations in such a way that the sequence [ri, . . . , r 2 „] satisfies the 
property D < ?'fe+i, k= 1, . . . , 2n — 1, i.e., a super-increasing Easy Knap- 
sack sequence. 

From the sum, the subsequence can be efficiently recovered. 



4.2 Operations in C-D-E-L to Solve 9^3SAT 

The initial plasmid is represented by a binary string of length 2n, with all ones. 
Writing a zero instead of a one is done by deleting a station from the plasmid; in 
this way, the site is disabled so that the restriction enzyme cannot cut anymore. 
This is done in 4 steps as follows: 

1. Apply restriction enzyme R which corresponds to site c that flanks s. This 
produces two linear DNA strands, a short one containing station s and a 
portion of site c, and the other long, having sticky ends at both ends (the 
sticky ends are portions of c). 

2. Remove the short DNA strand containing the station s. 

3. Using DNA polymerase, extend the DNA below the sticky ends of the long 
strand to make blunt ends. 

4. Apply ligase to recircularise the linear molecues ligating the blunt ends. 

Step 1 of Procedure: Remove all contradictions If the initial plasmid 
contains stations associated with both Xi and ~^Xi,i = 1, . . . ,n, then these are 
contradictions. To remove them for a variable Xi, we proceed as follows. Divide 
the solution into 2 test tubes Ti and T 2 . In Ti, write zero for station Sj (cor- 
responding to Xi). In T2, write zero for station ~^Si (corresponding to —'Xi). T\ 
and T2 are mixed together. Do this for all the variables in F, to remove all 
contradictions. 

Step 2 of Procedure: Suppose Ci = V V is the ith clause in F; Cij 
is the restriction enzyme site corresponding to the literal J = 1,2,3. Divide 
the solution into three test tubes L, M, R; in L apply restriction enzyme for 
to linearise all strands that has 0 for li^, i.e., 1 for Then, remove all 
linear strands from L. In M apply restriction enzyme and remove all linear 
strands to eliminate all strands in M that do not have 1 for literal . In i? do 
this for li^ . Pour L, M, and R into a new test tube, to get the resultant solution. 
Starting with clause ci, do this for all clauses Ci by taking the resultant from 
clause Ci-i,i = 2,. . . ,n. The resultant solution in test tube for clause c„ encodes 
all the satisfying assignments of F, if any. The resultant solution is analysed by 
gel separation. If more than one satisfying assignment is present in the final 
solution, the plasmids encoding the different assignments have different lengths. 
Since they contain different sets of sequences they form distinct bands in the 
gel platform. Any subsequence of an Easy Knapsack sequence has different sum 
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from the sums of other subsequences. Since the sequences are chosen from Easy 
Knapsack, the Vi can be retrieved from the sum. From the sum, the subsequence 
can be efficiently recovered. 

5 Breaking the Propositional Logic-Based Cryptosystem 

J. Kari [5] introduced a cryptosystem based on the propositional logic. He proved 
that the system was optimal in the sense that any cryptanalytic attack against 
this can be used to break any other system as well. He also showed that the 
problem of cryptanalytic attack encountered by the eavesdropper is NPhard. 

In [6] we had adapted the D-D-D procedure of T. Head [1] to give a method 
of cryptanalytic attack against this cryptosystem. Now using the C-D-E-L pro- 
cedure and the method explained above to solve #3SAT, we note that a more 
efficient and better method of cryptanalytic attack against the propositional 
logic-based cryptosystem of Kari can be given. This is done by applying ^SAT 
to F = A pi, where po and pi are the public keys. 

The final solution wil give all those truth assignment that make po false and pi 
true. There are several advantages over the earlier method. In the cryptanalytic 
attack proposed in [6] modifying D-D-D, it was required to execute the DNA 
algorithm for each bit in the cryptotext, while in the present C-D-E-L method 
(combining features of D-D-D and C-D-E-L) the final solution gives all the truth 
assignments. 



6 Conclusion 

In Adleman’s and Lipton’s methods, the construction of the exponential number 
of all potential solutions is considered at the initial step. In Tom Head’s proce- 
dure, this is replaced by initialization with one single circular molecular variety, 
which is seen to be capable of providing representation of all solutions as com- 
putation progresses. However, an exponential number of plasmids is necessary, 
and this limits the size of problems which can be handled in the laboratory. We 
hope to work on procedures which will overcome this difficulty. 
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Abstract. We present a variant of communicating distributed H sys- 
tems where each filter is substituted by a tuple of filters. Such systems 
behave like original ones with the difference that at each moment we use 
one element of the tuple for the filtering process and this element is re- 
placed after each use, periodically. We show that it suffices to use tuples 
of two filters in order to generate any recursively enumerable language, 
with two tubes only. We also show that it is possible to obtain the same 
result having no rules in the second tube which acts as a garbage collec- 
tor. Moreover, the two filters in a tuple differ only in one letter. We also 
present different improvements and open questions. 



1 Introduction 

Splicing systems (H systems) were the first theoretical model of biomolecular 
computing and they were introduced by T. Head. [5,6]. The molecules from 
biology are replaced by words over a finite alphabet and the chemical reactions 
are replaced by a splicing operation. An H system specifies a set of rules used to 
perform a splicing and a set of initial words or axioms. The computation is done 
by applying iteratively the rules to the set of words until no more new words 
can be generated. This corresponds to a bio-chemical experiment where we have 
enzymes (splicing rules) and initial molecules (axioms) which are put together 
in a tube and we wait until the reaction stops. 

Unfortunately, H systems are not very powerful and a lot of other models 
introducing additional control elements were proposed (see [14] for an overview) . 
One of these well-known models are communicating distributed H systems (CDH 
systems) or test tube systems (TT systems) introduced in [1]. This model intro- 
duces tubes (or components). Each tube contains a set of strings and a set of 
splicing rules and evolves as an H system. The result of its work can be redis- 
tributed to other tubes according to a certain filter associated with each tube. 
Each filter is an alphabet and a string will enter a tube only if it is composed 
only by symbols present in the filter associated with the tube. After that these 
strings form the new starting molecules for the corresponding tube. 

In the same paper it was shown that the model can generate any recursively 
enumerable language using a finite set of strings and a finite set of splicing rules. 
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The number of test tubes used in [1] to generate a language was dependent on 
the language itself. It was also shown that only one tube generates the class of 
regular languages. 

A series of papers then showed how to create test tube systems able to 
generate any recursively enumearble language using fewer and fewer test tubes. 
In [2] this result is obtained with 9 test tubes, in [11] with 6, and in [15] with 3 
tubes. 

In [1], it is shown that test tube systems with two test tubes can generate 
non regular languages, but the problem whether or not we can generate all 
recursively enumerable languages with two test tubes is still open. Nevertheless, 
in [3] R. and F. Freund showed that two test tubes are able to generate all 
recursively enumerable languages if the filtering process is changed. In the variant 
they propose, a filter is composed by a finite union of sets depending upon the 
simulated system. In [4] P. Frisco and C. Zandron propose a variant which uses 
two symbols filter, which is a set of single symbols and couples of symbols. A 
string can pass the filter if it is made from single symbols or it contains both 
elements from a couple. In the case of this variant two test tubes suffice to 
generate any recursively enumerable language. 

In this paper we propose a new variant of test tube systems which differs from 
the original definition by the filtering process. Each filter is replaced by a tuple 
of filters, each of them being an alphabet (be., like in the original definition), 
and the current filter i (the one which shall be used) is replaced by the next 
element of the tuple {i + 1) after its usage. When the last filter in the tuple is 
reached the first filter is taken again and so on. 

We show that systems with two tubes are enough to generate any recursively 
enumerable language. We present different solutions depending on the number 
of elements in the filter tuple. In Section 3 we describe a system having tuples 
of two filters. Additionally both filters of the first tube coincide. In Section 4 we 
present a system having tuples of four filters, but which contains no rules in the 
second tube. In this case, all the work is done in the first tube and the second 
one acts like a garbage collector. In the same section it is shown how to reduce 
the number of filters to three. Finally, in Section 5 we present a system with two 
filters which contains no rules in the second tube and for which the two filters 
in a tuple differ only in one letter. 



2 Basic Definitions 

2.1 Grammars and Turing Machines 

For definitions on Chomsky grammars and Turing machines we shall refer to [7]. 

In what follows we shall fix the notations that we use. We consider non- 
stationary deterministic Turing machines, i.e., at each step the head shall move 
to the left or right. We denote such machines by M = {Q, T, oq, qo, F, 6), where Q 
is the set of states and T is the tape alphabet, oq G T is the blank symbol, go & Q 
is the initial state, F C Q is the set of final (halting) states. By S we denote the 
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set of instructions of machine M . Each instruction is represented in the following 
way: qiakDaiqj. The meaning is the following: if the head of machine M being 
in state qi is scanning a cell which contains au then the contents of the scanned 
cell is replaced by ai, the head moves to the left (if D = L) or to the right (if 
D = R) and its state changes to qj . 

By a configuration of a Turing machine we shall understand the string wiqw2, 
where W\,W2 G T* {w2 ^ e) and q € Q. A configuration represents the contents 
of non-empty cells of the working tape of the machine {i.e., all other cells to the 
left and to the right are blank), its state and the position of the head on the 
tape. The machine head is assumed to read the leftmost letter of W2- Initially 
all cells on the tape are blank except finitely many cells. 



2.2 The Splicing Operation 



An (abstract) molecule is simply a word over some alphabet. A splicing rule (over 
alphabet E), is a quadruple {u\,U2, of words U\,U2, u[,U2 G V*, which is 

often written in a two dimensional way as follows: We shall denote the 

Ui U2 



empty word by e. 

A splicing rule r = (ui,U2,u[,U2) is said to be applicable to two molecules 
mi, m2 if there are words wi,W2,w[,W2 G V* with mi = W1U1U2W2 and 
m2 = w'iu'iU2W2, and produces two new molecules m[ = W1U1U2W2 and 
m'2 = w[u'iU2W2- In this case, we also write {mi, m2) \~r (^(,7712). 

A pair h = {V,R), where V is an alphabet and i? is a finite set of splicing 
rules, is called a splicing scheme or an H scheme. 

For an H scheme h = (V, R) and a language L C V* we define: 



ah{L) A {w,w' G V* I (wi,W2) \~r {w,w') for some wi,W2 G L, r G R}, 

al+\L)=al{L)iJah{ai{L)), *> 0 , 

= Ui>0CT^(-t')- 



A Head splicing system [5,6], or H system, is a construct H = {h,A) = 
{{V,R),A) of an alphabet V, a set A C V* of initial molecules over V, the 
axioms, and a set R CV* x V* x V* x V* oi splicing rules. H is called finite if 
A and R are finite sets. 

The language generated by H system H is: 



L{H)°A^ al{A). 

Thus, the language generated by H system H is the set of all molecules that 
can be generated in H starting with A as initial molecules by iteratively applying 
splicing rules to copies of the molecules already generated. 
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2.3 Distributed Splicing Systems 

A communicating distributed H system (CDH system) or a test tube system (TT 
system) [1] is a construct 

A={V,T, (Ai,i?i,Fi),...,(A„,i?„,K)), 

where V is an alphabet, T CV (terminal alphabet), Ai are finite languages over 
V (axioms), Ri are finite sets of splicing rules over V, and Fi CV are finite sets 
called filters or selectors, 1 < i < n. 

Each triple {Ai, Ri, Fi), 1 < z < n, is called a component (or tube) of A. 

An n-tuple (Li, . . . , L„), Li C V*, 1 < z < n, is called a configuration of the 
system; Li is also called the contents of zth component. For two configurations 
(Li, . . . , Ln) and (L^, . . . , L'^) we define: 

(Li,...,L„)^(Li,...,L'„) iff 



U \^a*{L,) n - y Ff 

where ai = (V,Ri) is the H scheme associated with the component z of the 
system. 

In words, the contents of each component is spliced according to the associ- 
ated set of rules like in the case of an H system (splicing step) and the result 
is redistributed among the n components according to filters Fp, the part which 
cannot be redistributed remains in the component (communication step). If a 
string can be distributed over several components then each of them receives a 
copy of the string. 

We shall consider all computations between two communications as a macro- 
step, so macro-steps are separated by communication. 

The language generated by A is: 

L{A) = {wGT* I zcG Li,3Li,...,L„ C V* : (Ai,...,A„) (Li, . . . , L„)}. 

A eommunicating distributed H system with m alternating filters (CDHF sys- 
tem or TTF system) is a construct 




\\J^*ALj)nF* 
Vi=i 



F={V,T, {Ai,R,,Fl 



( 1 ) 



(A D p(l) 



where V , T, Ai and Ri are defined as in a communicating distributed H system. 
Now instead of one filter Fi for each component i we have an nz-tuple of filters 

(r) (r) 

Ff, l<r < m, where each filter F) is defined as in the above case. At 
each macro-step fc > 1 only filter F^'^\ r = {k — 1) (mod m) + 1, is active for 
component z. 

We define: 



(4^\...,lW) = (Ai,...,A„), 



if+i) = Q a*{Lf^) n F^> U <(Lf)) n F* - Q Fjf> , A: > 1, 



U=i 



fc=i 



where ai = (V,Ri) is the H scheme associated with component z of the system. 
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This is very similar to the previous system. The only difference is that instead 
of one stationary filter Fi there is a tuple of filters , fI^^ and the filter 

to be used depends on the number of communications steps being done. 

The language generated by F is: 

L{F) = {w G T* \ w £ for some k > 1}. 

We denote by CDH„ (or TT„) the family of communicating distributed H 
systems having at most n tubes and by CDHF„^m (or TTF„^m) the family of 
communicating distributed H systems with alternating filters having at most n 
tubes and m filters. 

3 TTF2,2 Are Computationally Universal 

In this section we present a CDHF system with two components and two filters 
which simulates a type-0 grammar. Moreover, both filters in the first component 
are identical. 

Theorem 1. For any type-0 grammar G = {N, T, P, S) there is a communicat- 
ing distributed H system with alternating filters having two components and two 
filters F = (V, T, {Ai, Ri, F^^\ (^ 2 , -R 27 F 2 ^^)) which simulates G and 

L{F) = L{G). Additionally fI^^^ = F^^K 

We construct F as follows. Let 

IV U T U {Bj = {ai,02, ■ ■ ■ , an} 

(where we assume B = a„). 

In what follows we will assume the following: 

1 < i < n, aeA^UTU{i?}, 7 G{q;,/ 3}, 
beivuru{B}u{o}, 

17 = AT u T U {5} U {a, /?} U { 0 } 
u {X, r, Xp, x'^, U, u:, Y^, z, Z'}. 

The terminal alphabet T is the same as for the grammar G. 

Test tube I: 

Rules of Ri'. 
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Filter: 

= f[^'^ = iV U T U {5} U {a, /?} U {o} U {X, F, F„}. 
Axioms (Ai): 

{XBSY, Z(3a^Y^, Z o Y/s, X^PZ, X^PpZ, ZYp, X^Z, XppZ, ZY^} 
^{ZvY \ u ^ V G P}. 

Test tube II: 

Rules of i?2: 



7 


F' 

Q: . 


22 - 


7 


2 3- — 


>> 




V ’ 

a 


Xc^a 


z ’ 


2.3. ^ 


F 



XpPa^P 


a 


0 ,. x/ipp 


a 


2 6 • 

2 .b. 


OYi3 


Xai 




Zi.O . 

£ 






£ 



Filter: 

^2^^^ = iV U T U {R} U {a, /?} U {o} U {A„, Y^}, 
= NUTU{B}U{a,P}U {o} U {Xp, F^}. 
Axioms (A2): 

{FF„,A>Z, ZY,Xa^Z, Z’}. 



Ideas of the Proof 



The “Rotate-and-Simulate” Method The system uses the method of words’ 
rotation [12,13,14]- Let us recall it briefly. For any word w = w'w” € {N U T)* 
of the grammar G the word Xw" Bw'Y (A, Y,B ^ iV U T) of CDHF system 
F, is called a “rotational version” of the word w. The system F models the 
grammar G as follows. It rotates the word XW 1 UW 2 BY into Xw 2 Bw\uY and 



applies a splicing rule 



e 


uY 




vY ' 



So we model the application of a rule u 



of the grammar G by a single scheme-rule. F rotates the word XwaiY {ui G 
{NUTLl{B}) “symbol by symbol”, i.e., the word XaiwY will be obtained after 
some steps of working of system F. 

The rotation technique is implemented as follows [15,13,14]. Suppose that we 
have XwaiY and we rotate Oj, i.e., we shall obtain XaiwY after some steps of 
computation. We perform this rotation in three stages. First we encode ai by 
/3a* and we get XaPw^a^~^Y^. After that we transfer a’s from the right end of 
the molecule to the left end obtaining XjjPd^ PwYp. Finally we decode /3a*/3 into 
at which gives us XaiwY. More precisely, we start with the word XwaiY in the 
first tube. The second tube receives the word XaPwPa^~^Y^ from the first one. 
After that point the system works in a cycle where a’s are inserted at the left 
end of the molecule and simultaneously deleted at the right end. The cycle stops 
when there is no a at the right end. After that we obtain the word A^/3a*/3wF^ 
in the second tube and we decode ai obtaining XaiwY. 

If we have XwBY, then we take off X,Y and B obtaining w as a possible 
result. 
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Proof 

Our proof is done in the following way. We present how we can simulate the 
behaviour of grammar G hy F. In the same time we consider all other possible 
evolutions of molecules and we show that they do not lead to a valid result. 



Checking the Simulation We note that the molecules from the bottom part 
of each rule are already present in the corresponding tube. Moreover, one of the 
results of the splicing will contain Z and will not contribute for obtaining the 
result. So we will omit these molecules and we will write: 

XwUiY — > XwBa^~^Y' instead of 
1.2 “ 

(Xu;|a,y,Z|/3a*-iy^) hi,2 {Xw(ia^~^Y;^, ZaCY), 

where by | we highlighted the splicing sites. During the presentation we mark 
molecules which can evolve in the same tube with h and we mark molecules 
which must communicate to another tube with 1 or 2 (depending on tube). We 
also write m ] \i molecule m cannot enter any rule and cannot be communicated 
to another tube. 

We shall present the evolution of words of the form XwaiY. We shall describe 
splicing rules which are used and the resulting molecules. We note that, due to 
parallelism, at the same time in the system there can be several molecules of 
this form or intermediate forms which evolve in parallel. These molecules do not 
interact with each other, so we shall concentrate only on a single evolution. 



Rotation We shall give the evolution of word XwaiY which is in the first tube. 
Step 1 . 

Tube I. 

XwUiYh Xw/3a^”^Y(jh XQ,/3w/3a^“^Y(j2. 

We can use these two rules in opposite order which gives the same result. We 
can also use rule 1.5 instead of the last application. This gives us: 

XwPa'^~^Y^h. "f. 

We can also have: 

XwaiYh X^/3/3wYh. 

Molecule Xa[3wf3a^~^Y^ goes to the second tube because it can pass the 
filter and all other molecules cannot pass any of filters of the second tube. 

As we noted before there are other applications of splicing rules on other 
molecules but they are not relevant for our evolution. We shall not mention this 
fact in the future. 

Tube II. 

As we noted before there are other applications of splicing rules on other 
molecules but they are not relevant for our evolution. We shall not mention this 
fact in the future. 
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Step 2. 

Tube II. 

Xal3wl3a^~'^Yl^h Xa/3w/3o;^“^YQh X^a/3w/3o;^“^YQ.l. 

Only molecule X'^af3wf3a'‘~^Ya can pass filter Fi. 

In this way we transferred one a from the right end to the left end. 

Step 3. 

Tube I. 

X'^afiw j3a''~^Ya'h. X^a/3w/3a^“^Y^h XaQ;/3w/3a^“^Y^2. 

We can also have: 

X'^aPwPa'‘~^Ya'h. X^/3o;/3w/3a^”^YQh X^/3a/3w/3o;^“^Y^ "f. 

Only Xaa/3w/3a^~^Y^ can pass filter ^ 2 ^^ 

Step 4. 

Tube II. 

XaaPwPa’^~^Y^h. Xaaf3'wPa^~'^Ya'h. X^aa/3w/3a^“^Yo,l. 

Only molecule X'^aaf3w[3d^~‘^Ya can pass filter Fi. 

In this way we transferred one more a from the right end to the left end. We 
can continue in this way until we obtain for the first time X'^a^ f3wf3Ya in tube 2 
at some step p. This molecule is communicated to tube 1 where we can use the 
rule 1.6 and we complete the rotation of a,. 

Step p+1 . 

Tube I. 

X'^a^PwPYaii > X^a^/3wY^h > Xi^Pa^ (3'wYis2. 

1.6 1.8 

We can also have: 

X'^a'^PwPYa'b. > XaU^ P ti} (3Y'^h > XQ,a^/3wYfl "f. 

1.7 1.6 

(2) 

Only XpPa’^P'wYp can pass filter F 2 . We also add that p+1 is an odd num- 
ber, so molecule XpPa^ PwYp will remain for one step in the first component 
because the filter used in the second component during odd steps is ^ 2 ^^ Dur- 
ing next step (p+2) this molecule will not change and it will be communicated 

f2) 

to the second tube because the filter to be used will be F 2 ■ 

Step p+3. 

Tube II. 

X/jPa"^ PwYfjh — > XaiwY^h > XaiwYl. 

XttiwY goes to the first tube. 

In this way we competed the rotation of a^. 



Simulation of Grammar Productions If we have a molecule XwuY in the 
first tube and there is a rule u v € P, then we can apply rule 1.1 to simulate 
the corresponding production of the grammar. 



Obtaining Results Now we shall show how we can obtain a result. If we have 
a molecule of type XwBY, then we can perform a rotation of F as described 
above. We can also take off X and Y from the ends. In this way, if w € T*, then 
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w will be a result. So, now we shall show how to do this. Suppose that XwBY 
appears for the first time in tube 1 at some step q (it is easy to check that q is 
even) . 

Step q. 

Tube 1. 

XwBYh XwoY^h X^/3/3woY^2. 

We can also have: 

Xwo Ygh — > Xq/3wo Y /3 t . 

1.4 

XpPPw o Yi 3 can pass filter . 

Step q+1. 

Tube II. 

XaBBw o Ygh — > w o Y/jh — > wl. 

2.5 2.6 

So, if w G r* then w is a result. 



Final Remarks We start with molecule XBSY in the first tube. After that 
we are doing a rotation and simulation of productions of G. Finally we reach a 
configuration XwBY where w G T* . At this moment we eliminate X, Y and B 
obtaining w € T*. 

As all cases have been investigated, the proof is complete. 

4 TTF2,4 with Garbage Collection 

In this section we shall present a TTF system having two components and four 
filters. Additionally this system contains no rules in the second component, so 
it is used only as a garbage collector. Also, all filters in the second tube are the 
same. 

Theorem 2. For any type-0 grammar G = {N, T, P, S) there is a communicat- 
ing distributed H system with alternating filters having two components and four 
filters F = (y,T, (Ai, Ri, F^^\ . . . , F^^), {A 2 , R 2 , F 2 ^\ . . . , which simu- 

lates G and L{F) = L{G). Additionally, the second component of F has no 
splicing rules, i.e., R 2 = 0, and also F^^'^ = ■ ■ ■ = F^^\ 

We construct F as follows. 

Let A^ U T U {B} = {oi, 02 , . . . , a„} {B = a„). 

In what follows we will assume the following: 

1 < i < n, 1 < j < 4,a G NUTU{B}, 
b G Af U T U {5} U {o}, 7 G {a, fi}. 

Also let V = NUT U{B}U {a, fi} U {o}. 

F = V U {A, r, A„, Xp, Xf, Y^,Yf,Yp, Z, Z' , Z„ R,,Lj}. 

The terminal alphabet T is the same as for grammar G. 
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Test tube I: 

Rules of Ri\ 



1.1.1 : 


£ 


uY 


1.1.2 : 


X 


a 


Ri 


vY ’ 


Xc^P 


Li 


1.1.4 : 


a 


aiY 


1.1.5 : 


a 


BY 


Ri 


Pa^-^Y} ’ 


Ri 


oYp ’ 


1.2.1 : 


7 


Y' 


1.2.2 : 


x„ 


7 


i?2 


’ 

a 


x; 


a 


L2 


1.3.1 : 


X 


a 


1.3.2 : 


X4 


a 


Xc 


XT’ 


X0p 


Lz 


1.3.4 : 


b 


PYa 










Ri 


Y0 ’ 











X 


a 


X0PP 


Li 



1.3.3 : 



7 


aYc, 


Ri 


Y' 

^ OL 



1.4.1 : 
1.4.4 : 

1 . 1 . 1 ' : 
1.1.4' : 

1 . 2 . 1 ' : 

1.3.1' : 
1.3.4' : 

1.4.1' : 



di 


Y0 . 


, . 0 . X^Pa^p 


a 


143. X 0 PP 


a 


Ra 


Y ’ 


■ Xai 


Ll ’ 


£ 


Z'Li 



a 


oYp 


Z'Li 


£ 



Xc/? 


Zi 


1 1 0' . xppp 


^1 


Ri 


Li ’ 


• Ri 


Li 



^1 


Pa-^Y} 


115 '- — 


oYp 


Ri 


Li ’ 


• Ri 


Li 



Z2 


Xc 


1 00^ ^OL ^ 


^2 


Ri 


L2 ’ 


• R2 


L2 



X„ 


Zi 


1 . 3 . 2 ' : 


XpP 


Zi 


1 . 3 . 3 ' : 


X3 


YL 


Ri 


Li ’ 


■ Ri 


Li ’ 


Ri 


^3 


Zi 


X 0 . 














Ri 


Li ’ 














Zi 


Y 


1 . 4 . 2 ' : 


Xai 


^4 


1 . 4 . 3 ' : 


Z' 


^4 


Ra 


TI' 


Ra 


Li ’ 


Ra 


L4 



Filter: 

F'W = VU{X„,y^R2,i2}, 

=V\J{X p,Y 0 ,R^, Li}, 
f[^'^ =V\J{X,Y,Ri,Li}. 
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Axioms (Ai): 

{XBSY, Xa,f3Zi,Xf3f3pZi,Zipa^Y^,ZoYfi, A>Z2, 
XfipZs, ZsY^, ZYfi, ZiY, Xa^Zi, Z'Z^, RiLi} 
U{ZvY I M ^ u G P}. 

Test tube II: 

No rules (i ?2 = 0). 

Filter: 

=VU{X, F, A4, F„, F^ F^, A, L,}. 

Axioms (A 2 ): 

{R2L2, R3L3, R4L4} 



Ideas of the Proof 

The system is similar to the system from the previous section. It contains the 
same rules which are grouped in the first tube. The rules are split into four 
subsets (l.l.x-1.4.x) which cannot be used all at the same time. In order to 
achieve this we use the method described in [16,9]. We use molecules RjLj 
which travel from one tube to another. When a molecule RjLj appears in the 
first tube it triggers subset j of rules which can be applied (in [16] these molecules 
are called directing molecules). More precisely, the bottom part of each rule is 
made in a special way: the molecule which is present there can be produced only 
if we have the corresponding RjLj in the first tube. For example, rule 1.1.3 can 
be used only if molecule A^/3/3Li is present. But this molecule can be present 
only if i?iLi is present in the first tube (it is produced by applying rule 1.1.3’ 
on the axiom Xf}f3pZi and R\Li, and on the next step it is communicated to 
the second tube). 

This mechanism can be realized by using alternating filters. The filters act 
also like selectors for correct molecules and the computation goes in the following 
way: molecules are sent into the second tube and after that the correct molecules 
are extracted from this tube into the first one where they are processed. 

We also note that the operations performed in the first group and the third 
group are independent and the molecules are of different forms. So, we can join 
the first group with the third group (the corresponding filters 2 and 4 as well) 
and we obtain: 

Theorem 3. For any type-0 grammar G = (A, T, P, S) there is a communicat- 
ing distributed H system with alternating filters having two components and three 
filters P = (y,T, (Ai, Ri, F^^\ . . . , F^^'^), {A 2 , R 2 , F^^\ . . . , F^^'^)) which simu- 
lates G and L{F) = L{G). Additionally, the second component of F has no 
splicing rules, i.e., R 2 = 0, and also F^'’ = • • • = 

5 TTF2,2 with Garbage Collection 

In this section we shall present a TTF system with two components and two 
filters which has no rules in the second component. Moreover, the two filters for 
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both components differ only in one letter. Also the proof is different: we shall 
use a simulation of a Turing machine instead of a Chomsky grammar. 

We define a coding function (p as follows. For any configuration w\qw2 of a 
Turing machine M we define <f>{wiqw2) = Xw\Sqw2Y , where X,Y and S are 
three symbols which are not in T . 

We also define the notion of an input for a TTF system. An input word 
for a system T is simply a word w over the non-terminal alphabet of T. The 
computation of F on input w is obtained by adding w to the axioms of the first 
tube and after that by making T to evolve as usual. 

Lemma 1. For any Turing machine TM = {Q,T,ao, sq , F,S) and for any in- 
put word w there is a communicating distributed H system with alternating 
filters having two components and two filters, F = (V,Tr, {Ai, Ri, 

(A2, i? 2 , ^2^^ -^2^^)) which given the input 4 >{w) simulates M on input w, i.e., 
such that: 

F for any word w G L{M) that reaches a halting configuration w\qw2, F will 
produce a unique result 4>{wiqw2); 

2 . for any word w ^ L{M), F will produce the empty language. 

Let T = {ao,...,am-i}, Q = {qo, ■ ■ ■ ,qn-i}, a G TU {A}, b,d,e G T, 
c G T U {T}, q G Q, ao - blank symbol. 

We construct T as follows: 

V = TUQU{X,Y,S, S', Ry, R, L, R', L', R^, Zf , Zx,Ri, Zf,R'^,Z'^}. 

Tr = TU{X,Y,S}U{q\qGF}. 

Test tube I: 

Rules of i?i: 
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Filter: 

f[^'^ = {R, L, L'} = {Fl \ {R'}) U {R}. 

Axioms (Ai): 

XSqowY, R^aiS'qjZ^iBqiQkRaiqj G S),R^S'qjhaiZ^(3q^akLaiqj G (5), 
RyaoY, XaoZx,R[SZ[,RL. 



Test tube II: No rules (i ?2 is empty). 

Filter: 

f!^^^ = g u t u {A, y, S', r, l, r', l', i?f , i?f , r[}, 

^ = g U T U {A, y, S', R, L, R', L', i?f , i?f , R[} = {F^ - {S}) U {S'}. 
Axioms (A 2 ): R'L'. 



Ideas of the Proof 

The tape of M is encoded in the following way: for any configuration wiqw 2 of 
M we have a configuration XwiSqw 2 Y in the system F. So, the tape is enclosed 
by A and Y and we have a marker Sqi which marks the head position and the 
current state on the tape. 

F simulates each step of M in 2 stages. During the first stage it simulates a 
step of the Turing machine and it primes symbol S using rules 1.1.x. 2-1. 1.x. 4. 
During the second stage it unprimes S' using rules 1.2.x. The tape can be ex- 
tended if necessary by rules l.l.x.l. The system produces a result only if M halts 
on the initial string w. 

In order to achieve the separation in 2 stages we use two subsets of rules in the 
first tube and the directing molecules RL and R' L' which appear alternatively 
in the first and the second test tube will trigger the first (1.1.x) or the second 
(1.2.x) subset (see also section 4 and [16]). All extra molecules are passed to the 
second test tube which is considered as a garbage collector because only RL and 
R'L' may go from that tube to the first one. 

Molecules RyaoY, R^aiS'qjZf^, XaoZx, Ri S'qjhaiZi , R'^SZ'^, are present 
in test tube 1 all the time and they do not leave it. Also all molecules hav- 
ing Z with indices at the end remain in the first test tube. 

We start with XwiSqoW 2 Y in the first test tube where w\qoW 2 is the initial 
configuration of M . 



Checking the Simulation 

We completely describe the simulation of the application of a rule with a move 
to right {qiakRaiqj G S) to configuration Xw\SqiakW 2 Y . Now also we shall omit 
molecules RyaoY, R^aiS'qjZ^, XaoZx, Ri S'qjhaiZ^ , R'^SZ'^ which are always 
present in the first tube. Also the molecules having Z with indices at the end 
and which do not alter the computation will be omitted from the description 
after one step of computation. 
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We also note that the second tube may contain some other molecules (ob- 
tained during the previous steps) as it is a garbage collector. We shall omit these 
molecules as they do not alter our simulation. 

Step 1 . 

Splicing: 

Tube 1: 

We have: XwiSqiakW2Y, RL. 

We apply the following rules which are all from group 1.1.1.x: 
{Xwi\Sq^akW 2 Y, R\L) hi, 1,1.2 {XwiL, RSq^akW 2 Y). 

{RSqiak\w2Y, R^aiS'qj\Z^) hi, 1.1,3 {RfaiS' qjW2Y, RSqittkZ^). 

{Xwi\L, R^\aiS' qjW2Y) hi, 1,1.4 {XwiaiS'qjW2Y,R^L). 

Tube 2: 

We have: R'L'. As it was said above we can also have a lot of other molecules 
here which do not alter the computation. We omit them and we do not mention 
this in the future. 

No application of rules. 

Communication: 

The current filter is =QUTU{X,Y,S, R, L, R', L', , i?f , R[}. 

Molecules going from tube 1 to tube 2: 

RL, XwiSqiakW 2 Y, XwiL, RSqiakW 2 Y, R^L. 

Molecules remaining in tube 1: 

R^aiS'qjU>2Y, RSqiUkZ^, Xw\aiS' qjW2Y. 

Molecules going from tube 2 to tube 1: 

R'L'. 

Molecules remaining in tube 2: 
none. 

Molecule RSqiakZ^ cannot evolve any further so we 
checking. Also molecule RiaiS'qjW2Y will go from the 
second tube during the next iteration. 

Step 2. 

Splicing: 

Tube 1: 

We have: R^aiS' qjW2Y, XwiaiS' qjW2Y, R' L' . 

We apply the following rules which are all from group 1.2.x: 

{Xw\ai\S' qjW2Y, R'\L') hi, 2,1 {XwiaiL' , R' S'qjW2Y). 

{R'S'\qjW2Y,R',S\Z',) hi.2.2 {R[SqjW2Y,R'S'Z'^). 

{Xwiai\L' , R[\SqjW2Y) hi, 2.3 {XwiaiSqjW2Y, R'^L'). 

Tube 2: 

We have: XwxSqiakW2Y, RL, Xw\L, RSqiakW2Y, R'^L' . 

No application of rules. 

Communication: 

The current filter is =Q\JT\J{X,Y, S', R, L, R' , L' , R{,Rf, R'^}. 
Molecules going from tube 1 to tube 2: 

R' L' , R^aiS' qjW 2 Y, XwiaiS'qjW 2 Y, Xw\aiL' , R' S' qjW 2 Y, R'^L' . 



shall omit it from the 
first test tube to the 
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Molecules remaining in tube 1: 

R'S'Zi, R[SqjW2Y, XwiaiSqjW2Y. 

Molecules going from tube 2 to tube 1: 

RL. 

Molecules remaining in tube 2: 

R'L' , R^aiS' qjW2Y^ XwiaiS' qjW2Y, XwiaiL', R' S' qjW2Y^ R[L' , 
XwiSqiakW 2 Y, Xw\L, RSqiakW 2 Y, R^L. 

On the next iteration R'^SqjW 2 Y will go from the first test tube to the second 
tube. 

Thus we modelled the application of the rule {qi,ak, R,ai,qj) of the Turing 
machine M. It is easy to observe that ii W 2 = s we first apply rule 1.1. 1.1 in 
order to extend the tape to the right and after that W 2 = oq. Also it is easy 
to check that additional molecules obtained (such as XwiSqiakW 2 Y, Xw\L, 
RSqiakW2Y, R^L, R^aiS' qjW2Y^ RSqiakZ^ , Xw\aiS' qjW2Y^ R'S'qjW2Y, 

R'S'Zi, R[SqjW 2 Y, Xw\aiS' qjW 2 Y, Xw\aiL') either go to the second test tube 
or do not alter the computation. 

For the left shift rules the things are similar, except that we ensure that there 
are at least 2 symbols at the left of S. 

We see that we model step by step the application of rules of M. It is easy 
to observe that this is the only possibility for the computations to go on as the 
molecule which encodes the configuration of M triggers the application of rules 
of r. 

Theorem 4. For any recursively enumerable language C C T* there is a com- 
municating distributed H system with alternating filters with two components and 
two filters r = {V,T,{Ai,Ri,fI^^\f^^'^),{A 2 ,R 2 ,F 2 ^\F 2 ‘^'^)), which generates 
C, i.e., C = L{F). 

Proof. For the proof we use ideas similar to ones used in [10]. 

Let C = {icojii'ij . ■ . }. Then there is a Turing machine Tc which computes 
WiO ... 0 starting from where 0 is the blank symbol. 

It is possible to construct such a machine in the following way. Let G be a 
grammar which generates C. Then we can construct a Turing machine M which 
will simulate derivations in G, i.e., we shall obtain all sentential forms of G. We 
shall simulate bounded derivation in order to deal with a possible recursion in 
a derivation. After deriving a new word the machine checks if this word is a 
terminal word and, if this is the case, it checks if this word is the z-th terminal 
word which is obtained. If the last condition holds, then the machine erases 
everything except the word. 

Moreover, this machine is not stationary (i.e., the head has to move at each 
step) and it never visits cells to the left of the 0 from the initial configuration 
(i.e., the tape is semi-infinite to the right). Ideas for doing this can be found 
in [8]. So, machine Tc transforms configuration into qfWkO ... 0. 

Now starting from Tc it is possible to construct a machine T(. which computes 
Qlk-e2 Mq'^yjf.0 ... 0 starting from goOl^'*'^. Now, using F we shall cut off Wk and 
pass to configuration . In this way we will generate C word by word. 
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In order to simplify the proof we shall consider machine T'^ which will do the 
same thing as but at the end will move its head to the right until the first 0, 
i.e. will produce Mwkq'fO • • ■ 0 from We use also two additional 

Turing machines Ti and T 2 having the tape alphabet of extended by the set 
{X, F, S'}. The first machine, Ti, moves to the left until it reaches M and then 
stops in the state qj. The starting state of Ti is ql- The second machine, T 2 
moves to the left until it reaches 0 and then stops in the state go- The starting 
state of this machine is q^. 

Now we consider T which is very similar to the system constructed in the 
previous lemma. In fact, we take machine T'^ and we construct T for this machine 
as it is done before. After that we take the rules of Ti and T 2 and we add 
the corresponding splicing rules and axioms to T. Finally we add the following 
splicing rules: 



xl — 


hdSq'lO 


x2 — 


nSq]M 




C 


X . 1 . ry 

Zi 


aS'g^bd’ 


^2 


eSq^n ’ 


X 

£ 


Z' 



where e = {0, 1}. 

We also add AggOlF to the axioms in the first tube and we consider T as a 
terminal alphabet. 

We claim that T can generate C. First of all it simulates machine so 
it permits to pass from configuration AS'go01^“'‘^F to . . . OF. 

Now we can use rule x.l. As a result we have the part 0 . . . OF which is cut off and 
sent to the second tube, and a new molecule XOl'^'^'^Mw'i.Sqlab {wk = w'k^b) 
which is obtained. Now we can easily observe that we can simulate the work 
of Ti because the only condition which must be satisfied is the presence of one 
symbol at the right from the head position. So at this moment F will simulate 
the work of Ti and we will arrive to the following molecule: XOl'^'^'^SqjMwk- 
Now by rules x.2 and x.2' we will cut off Wk which will represent a result and 
we will obtain the molecule XQl^ Sq^llY . In a similar way we can simulate 
the work of T 2 and we will arrive to configuration XSqtJdl^'^^Y . So, we showed 
that we succeeded to implement the mechanism of generation of words which we 
presented before. 

6 Conclusions 

We presented a new variant of communicating distributed H systems which uses a 
tuple of filters instead of one single filter. We showed that using two components 
it is possible to generate any recursively enumerable language. We presented four 
systems which differ in the number of filters being used. In the first system we 
use two filters and additionally both filters for the first tube coincide. The second 
system uses four filters, but do not contain any rules in the second component 
which serves for garbage collection. We showed that it is possible to reduce the 
number of filters down to three. The last system contains two filters, but do not 
contain any rule in the second component and the two filters in each tuple differ 
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only in one letter. In the first three systems we simulate a Chomsky grammar 
and in the last system we simulate a Turing machine. 

It is an open problem whether or not it is possible to have no rules in the 
second component and only one tuple containing different filters (as in the case 
of the first system) . Also it is interesting to simulate Chomsky grammars instead 
of Turing machines in the case of the last system. 
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